Abstract
The maximum entropy principle introduced by Jaynes proposes that a data distribution should maximize the entropy subject to constraints imposed by the available knowledge. Jaynes provided a solution for the case when constraints were imposed on the expected value of a set of scalar functions of the data. These expected values are typically moments of the distribution. This paper describes how the method of maximum entropy PDF projection can be used to generalize the maximum entropy principle to constraints on the joint distribution of this set of functions.
1. Introduction
1.1. Jaynes’ Maximum Entropy Principle
The estimation of probability density functions (PDF) is the cornerstone of classical decision theory as applied to real-world problems. The maximum entropy principle of Jaynes [1] proposes that the PDF should have maximum entropy subject to constraints imposed by the knowledge one has about the density. Let be a set of N random variables . The entropy of the distribution is given by
Jaynes worked out the case when the knowledge about consists of the expected value of a set of K measurements. More precisely, he considered the K scalar functions and constrained the expected values:
If then (2) are moment constraints.
1.2. Feature Distribution Constraints
Jaynes’ results had initial applications in statistical mechanics and thermodynamics [2], and have found more applications in a wide range of disciplines [2,3,4,5]. However, we would like to extend the results by replacing constraints (2) with constraints that are more meaningful in real-world inference problems. Instead of knowing just the average values of , suppose we knew the joint distribution of , denoted by . This carries more information than the average values of each measurement . Because the number of parameters K is small compared with the dimension of , it is feasible to estimate from a set of training samples using kernel-based PDF estimation methods, for example. This constraint is more general and can be adapted to produce something similar to Jaynes’ constraints (2) if the marginal measurement distributions are assumed independent, and Gaussian with mean . This has immediate applications in a wide range of fields, for example in speech analysis and recognition where could be MEL frequency cepstrum coefficients (MFCC) [6] extracted from the time-series data , or in neural networks, where could be the output of a network.
Note that the distribution can be obtained from by marginalization:
where the integral is carried out on the level set or manifold given by
The constraint problem can then be re-stated as follows:
Problem 1.
Given a known distribution , maximize the entropy of subject to
The solution to this problem is called maximum entropy PDF projection [7,8,9].
1.3. Significance
The main significance of maximum entropy PDF projection is the de facto creation of a statistical model through the extraction of features. Once a feature extraction has been identified, and it meets some mild requirements given below, a statistical model has been determined. This has a number of advantages, not the least of which is that the “art” of extracting features, i.e., signal processing, is well established, and many good methods exist to extract meaningful information from data. For example, the extraction MFCC features for processing speech signals has been developed to approximate human hearing [6], and, therefore, with maximum entropy PDF projection, should lead to statistical data models which share some qualities with human perception. Before maximum entropy PDF projection, comparing feature extraction methods had to be done based on secondary factors such as classification results. Maximum entropy PDF projection allows a feature extraction method to be evaluated based its corresponding statistical model.
The use of the maximum entropy principle assures the fairest means of comparing two statistical models derived from competing feature extraction methods. In most real-world applications, we cannot know , and must be satisfied with estimating it from some training data. Suppose that we have a set of K training samples , and have a number of proposed PDFs computed using (6) for various feature transformations . Let these projected PDFs be denoted by . We would like to determine which projected PDF (i.e., which feature vector) provides a “better” fit to the data. One approach would be to compare the PDFs based on the average log-likelihood choosing the feature transformation that results in the largest value. However, likelihood comparison by itself is misleading, so one must also consider the entropy of the distribution, which is the negative of the theoretical value of . Distributions that spread the probability mass over a wider area have higher entropy since the average value of is lower. The two concepts of Q and L are compared in Figure 1 in which we show three competing distributions: , , and . The vertical lines represent the location of the K training samples. If is the average value of at the training sample locations, then clearly However, choosing is very risky because it is over-adapted to the training samples. Clearly, has lower entropy since most of the probability mass is at places with higher likelihood. Therefore, it has achieved higher L at the cost of lower Q, a suspicious situation. On the other hand, , but . Therefore, has achieved higher L than without suffering lower Q, so choosing over is not risky. If we always choose among models that have maximum possible entropy for the given choice of features, we are likely to obtain better features and better generative models.
Figure 1.
Comparison of entropy Q and average log-likelihood L for three distributions. The vertical lines are the locations of training samples.
2. Main Results
2.1. MaxEnt PDF Projection
The solution to Problem 1 is based on PDF projection [10]. In PDF projection, one is given a feature distribution and constructs a PDF as follows:
where is a reference distribution meeting some mild constraints [10], and is the corresponding distribution imposed by on the measurements , i.e., is Equation (3) applied to . It can be shown that
The last item in the list indicates that, to solve Problem 1, it is only necessary to select the reference distribution for maximum entropy (MaxEnt).
To understand the solution to this problem, it is useful to consider the sampling procedure for (6). To sample from distribution (6), one draws a sample from PDF ; then, is drawn from the set , defined in (4). Note, however, that to conform to (6), it is necessary to draw sample from with probability proportional to the value of . The distribution of on the manifold may be thought of the conditional distribution , and it is proportional to . It is in fact
It can be verified that (7) integrates to 1 on the manifold . The entropy of (6) can be decomposed as the entropy of plus the expected value of the entropy of the (see Equation (8) in [8]):
Maximizing this quantity seems daunting, but there is one condition under which has the maximum entropy for all , and that is when is the uniform distribution for all . This, in turn, is achieved when has a constant value on any manifold .
This process of selecting for maximum entropy is called maximum entropy PDF projection [8,9]. The maximizing reference distribution is written and the MaxEnt distribution is written
which is the unique distribution that solves Problem 1.
In order that it is possible to select for MaxEnt, the feature transformation must be such that the uniform distribution can be defined on for any . Thus, must be bounded and integrable. This condition is easily met if the feature contains information about the size of so that when is fixed to a finite value, the has a fixed norm. To say this formally, let there exist a function such that for some valid norm on the range of .
Once this condition is met, then is any distribution that is constant on any level set . This happens if there exists a function c such that
Interestingly, any meeting these constraints results in the same distribution (6) [8]. This means that, although is not unique, is unique—it must be unique if it is the maximum entropy PDF.
The above conditions can be easily met by inserting an energy statistic into the feature set , and defining a reference distribution that depends on only through this energy statistic. The energy statistic is a scalar statistic from which it is possible to compute a valid norm on the range of , denoted by . In summary, the simplest way to solve for the MaxEnt projected PDF given the range of , denoted by , involves these three steps:
- Identify a norm valid in A norm must meet the properties of scalability , triangle inequality , and
- Identify a scalar statistic (energy statistic) from which it is possible to compute :
- Use a reference hypothesis depending only on .
The data generation process for MaxEnt PDF projection, corresponding to distribution (8) does not depend on and is the following:
- From the known distribution , draw a sample denoted by .
- Now identify the set of all samples mapping to , denoted by
- Draw a sample from this set, uniformly, so that no member of is more likely to be chosen than another.
The maximum entropy nature of the solution can be recognized in the uniform sampling on the level set The last item above is called uniform manifold sampling (UMS) [9]. The data generation process for three cases of are provided in Section 3.1, Section 3.2 and Section 3.3.
3. Examples
The implementation of MaxEnt PDF projection depends strongly on the range of the input data , denoted by . In this section, examples are provided for three important cases of .
3.1. Unbounded Data
Let range everywhere in . The 2-norm is valid in and can be computed from the total energy
The Gaussian reference hypothesis can be written in terms of :
so naturally will have a constant value on any manifold . Naturally, it is not necessary to include explicitly in the feature set—it is only necessary that the 2-norm can be computed from .
The distribution can be determined in closed form for some feature transformations [11,12]. For others, the moment generating function can be written in closed form, which allows the saddle point approximation to be used to compute [11]. More on this will be presented in Section 4.1.
An important case where a closed-form solution exists is the linear transformation combined with total energy:
This case is covered in detail in ([8], Section IV.C, p. 2821), and in ([9], Section III.B, p. 2459).
The following simple example demonstrates the main points of this case. Assume input data dimension and a feature transformation consisting of the sample mean and sample variance:
where
Note that can be computed from ,
which satisfies the requirement that the 2-norm of can be computed from .
Under the assumption that is distributed according to the standard Normal distribution (9), will have mean 0 and variance ,
and will have the chi-square distribution with degrees of freedom and scaling , which is given by
where . Furthermore, and are statistically independent. Therefore, For the given feature distribution, we assume components of are independent and Gaussian
with given mean and variance , where . The MaxEnt projected PDF, given by is plotted on the left of Figure 2 for slice of at . The density values shown in the figure, summed over all three axes and properly scaled added to a value 0.9999999998, which validates with numerical integration that is a density. Notice that the probability is concentrated on a circular region. This can be understood in terms of the sampling procedure given below.
Figure 2.
(Left) illustration of projected PDF for , , , on a slice of , at ; (Right) samples drawn from the sampling procedure (see text).
To sample from , we first draw a sample of from , denoted by , which provides values for the sample mean value and variance . Then, must be drawn uniformly from the manifold which are conditions on the sample mean and variance. This is easily accomplished if we note that the sample mean condition is met for any of the form
where is the ortho-normal matrix spanning the space orthogonal to the vector To meet the second (variance) condition, it is necessary that
This condition defines a hypersphere in dimensions, which explains the circular region in Figure 2. This hypersphere is sampled uniformly by drawing independent Gaussian random variables, denoted by , then scaling so that Then, is constructed using (10). Samples drawn in this manner are shown on the right side of Figure 2. To agree with the left side of the figure, only samples with are plotted.
Please see the above-cited references for using general linear transformations.
3.2. Positive Data
Let have positive-valued elements, so ranges in the positive quadrant of , denoted by . This holds whenever spectral or intensity data is processed. The appropriate norm in this space is the 1-norm
To satisfy conditions for maximum entropy, it must be possible to compute the statistic from the features. The exponential reference hypothesis can be written in terms of :
so naturally will have a constant value on any manifold , and is the appropriate reference hypothesis for maximum entropy. The inclusion of explicitly in the feature set is only one way to insure that is compact—it is only necessary that the 1-norm can be computed from .
An important feature extraction is the linear transformation
Note that is necessary that statistic can be computed from , which can be accomplished, for example, to making the first column of constant. This case is covered in detail in ([8], Section IV.B, p. 2820), and in ([9], Section IV, p. 2460). Sampling is accomplished by drawing a sample from and then drawing a sample uniformly from the set
The following simple example demonstrates the main theoretical concepts. We assume a data dimension of so that the distribution can be visualized as an image. The feature transformation is simply the sum of the samples:
Under the exponential reference hypothesis, the feature distribution is chi-square with degrees of freedom and scaling :
where For the given feature distribution, we assume Gaussian
with a given mean and variance . The MaxEnt projected PDF, given by is plotted in Figure 3. The density values shown in the figure, when properly scaled, summed to a value 0.9998, which validates with numerical integration that is a density. Note that the distribution is concentrated on the line , and is flat on this line, as would be expected for maximum entropy. To sample from this distribution, we first draw a sample from and then draw a sample on the line given by . This can be done by sampling uniformly in , then letting Samples drawn in this way are shown on the right side of Figure 3.
Figure 3.
(Left) illustration of projected PDF for , ; (Right) samples drawn from the sampling procedure (see text).
This example generalizes to higher dimension and to arbitrary linear transformations for full-rank matrix . In this case, is not chi-square, and in fact is not available in closed-form. However, the moment-generating function is available in closed-form so the saddle point approximation may be used (See Section IV.A, p. 2245 in [11]). Samples of are drawn by drawing a sample from and then sampling uniformly in the set . At high dimensions, this requires a form of Gibbs sampling called hit and run (see Section IV, p. 2460 in [9]).
3.3. Unit Hypercube,
Let have elements limited to . This case is common when working with neural networks. This is called the unit hypercube, denoted by . The uniform reference hypothesis
produces maximum entropy. No norm-producing energy statistic is needed. Naturally, will have a constant value on any manifold .
The following simple example demonstrates the main theoretical concepts. We assume a data dimension of so that the distribution can be visualized as an image. The feature transformation is simple the sum of the samples:
For this case, the uniform distribution brings maximum entropy, Under the reference hypothesis, the feature distribution is Irwin-Hall, given by
where For , this is a triangular distribution
For the given feature distribution, we assume Gaussian
with a given mean and variance . The MaxEnt projected PDF, given by is plotted in Figure 4. The density values shown in the figure, when properly scaled, summed to a value 0.999, which validates with numerical integration that is a density. Note that the distribution is concentrated on the line , and is flat on this line, as would be expected for maximum entropy. To sample from this distribution, we first draw a sample from and then draw a sample on the line given by . This can be done by finding where the line that intercepts the axes, and sampling uniformly in the interval between the intercepts. Note that this sampling differs from the previous example as a result of the upper bound at 1.
Figure 4.
Illustration of projected PDF for , .
This example generalizes to higher dimension and to arbitrary linear transformations for full-rank matrix . In this case, is no longer Irwin-Hall and in fact is not available in closed-form. However, the moment-generating function is available in closed-form so the saddle point approximation may be used (see Appendix in [13]). Samples of are drawn by drawing a sample from and then sampling uniformly in the set . At high dimensions, this requires a form of Gibbs sampling called hit and run (see p. 2465 in [9]).
4. Advanced Concepts
4.1. Implementation Issues
Implementing (8) seems like a daunting numerical task, since is some canonical distribution, for which a real data sample normally lies in the far tails of both and . However, if the distributions are known exactly, and are represented in the log domain, then the difference
typically remains within very reasonable limits. In some cases, terms in and cancel, leaving (13) only weakly dependent on (for example, see Section IV.A, p. 2820 in [8]).
Evaluating is mostly trivial since it is normally a canonical distribution, such as Gaussian, exponential, or uniform. Calculating , however, remains the primary challenge in maximum entropy PDF projection. However, when evaluating seems daunting, there are several ways to overcome the problem.
- Saddle Point Approximation. If is not available in closed form, the moment-generating function (MGF) might be tractable. This allows the saddle point approximation (SPA) to be used (see Section III in [11]). Note that the term “approximation” is misleading because the SPA approximates the shape of the MGF on a contour, not the absolute value, so the SPA expression for remains very accurate, in the far tails, even when itself cannot be evaluated in machine precision. Examples of this include general linear transformations of exponential and chi-squared random variables (see Section III.C and Section IV in [11]), general linear transformations of uniform random variables (Appendix in [13]), a set of linear-quadratic forms [14], and order statistics [15].
- Floating reference hypothesis. There are conditions under which the MaxEnt reference hypothesis is not unique, so it can depend on a parameter , so we write . An example is when the feature contains the sample mean and sample variance (see example in Section 3.1). In this case, a Gaussian reference hypothesis can be modified to have any mean and variance , and can serve as the MaxEnt reference hypothesis with no change at all in the resulting projected PDF. In other words, (13) is independent of —this can be verified by cancelling terms. Therefore, there is no reason that cannot be made to track the data—that is, let , . By doing this, will track , allowing simple approximations based on central limit theorem to be used to approximate .
- Chain Rule. When cannot be derived for a feature transformation, it may be possible to break the feature transformation into stages, where each stage can be easily analyzed. The next section is devoted to this.
4.2. Chain Rule
The primary numerical difficulty in implementing (8) is the calculation of . Solutions for many of the most useful feature transformations are available [9,11,12,13]. However, in many real-world applications, such as neural networks, the feature transformations cannot be easily written in a compact form . Instead, they consist of multi-stage transformations, for example, , , and The individual stages could be the layers of a neural network. In this case, it is best to apply (8) recursively to each stage. This means that the distribution of the first stage features is written using (6) with taking the role of input data, and so forth. This results in the chain-rule form:
where , , are canonical reference hypotheses used at each stage, for example (9), (11), and (12), depending on the range of , , and , respectively.
To understand the importance of the chain-rule, consider how we would compute (6) without the chain rule. Let be the combined transformation
and let be one of the canonical reference distributions. Consider the difficulty in deriving . At each stage, the distribution of the output feature becomes more and more intractable, and trying to estimate is futile because generally a canonical reference distribution is completely unrealistic as PDF for real data. Furthermore, is more often than not evaluated in the far tails of the distribution. With the chain-rule, however, we can assume a suitable canonical reference hypothesis at the start of each stage, and only need to derive the feature distribution imposed on the output of that stage.
As long as the reference hypothesis used at each stage meets the stated requirements given in Section 2.1, then the chain as a whole will indeed produce the desired MaxEnt projected PDF, which is the PDF with maximum entropy among all PDFs that generate the desired output feature distribution through the combined transformation [8]!
An example of the application of the chain-rule is the computation of MEL frequency cepstral coefficients (MFCC), commonly used in speech processing. Let us consider a frame of data of length N, denoted by . The processing is broken into the following stages:
- The first step, denoted by is to convert into magnitude-squared discrete Fourier transform (DFT) bins. Under the standard Gaussian assumption (9), the elements of are independent and have chi-squared statistics (see Section VI.D.1, pp. 47–48 in [12]).
- The second step is to sum energy in a set of K MEL-spaced band functions. This results in a set of K band energies. This can be written using the matrix as the linear transformation This feature transformation is explained in Section 3.2 above so an exponential reference distribution can be assumed for . Care must be taken that the K band functions add to a constant—this insures the energy statistic is “contained in the features”.
- The next step is to compute the log of the K band energies, . This is a 1:1 transformation for which PDF projection simplifies to computing the determinant of the transformation’s Jacobian matrix (see Section VI.A, p. 46 in [12]).
- The last step is the discrete cosine transform (DCT), which can be written as a linear transformation . If some DCT coefficients are discarded, then the transformation must be analyzed as in Section 3.1 above by including the energy statistic
This example illustrates that complex feature transformations can be easily analyzed if broken into simple steps. More on the above example can be found in Sections V and VI in [8].
4.3. Large-N Conditional Distributions and Applications.
When the feature value is fixed, then sampling on the manifold , called UMS, has some interesting interpretations relative to maximum entropy. Let the conditional distribution be written . Notice that is not a proper distribution since all the probability mass exists on the manifold of zero volume. Writing down in closed form or determining its mean is intractable. It is useful, however, to know because, for example, the mean of is a point estimate of based on , a type of MaxEnt feature inversion. However, depending on the range of , as exemplified by the three cases in Section 3.1, Section 3.2 and Section 3.3, can be approximated by a surrogate distribution (See p. 2461 in [9]). The surrogate distribution is a proper distribution that (a) has its probability mass concentrated near , (b) has constant value on , and (c) has mean value on the manifold, so . The surrogate distribution therefore meets the same conditions as but is a proper distribution. The mean of the surrogate distribution is a very close approximation to the mean of , which can be called th centroid of , but can be computed. In Section 3.1, Section 3.2 and Section 3.3, the surrogate distribution is Gaussian, exponential, and truncated exponential, respectively. These are the MaxEnt distributions under applicable constraints. It was shown, for example, when the range of is the positive quadrant of , that the centroid corresponds to the classical Maximum Entropy feature inversion approach for a dimension-reducing linear transformation of intensity data, for example to sharpen images blurred by a point-spread function [9]. The method, however, is more general because it can be adapted to different ranges of [9].
5. Applications
5.1. Classification
Assume there are M classes and the M class hypotheses are The general form of the classifier by applying Bayes theorem and (8) is given by
where is the prior class probability, and is a PDF estimate for class hypothesis . For the classification problem, there are many classifier topologies for using (8) to construct .
- Class-specific features. One can specify a different feature transformation per class, ,but the numerator is common, so the classifier rule becomesThis amounts to just comparing the likelihood ratio between class hypothesis and the reference distribution, computed using a class-dependent feature [16].
- It is not necessary to use a common reference hypothesis. A class-dependent reference hypothesis can be selected so that the feature is an approximately sufficient statistic to discriminate the given class from the class-dependent reference hypothesis. Then,where is the class-dependent reference hypothesis. Note that, when using the chain-rule (14), there is not a single reference hypothesis associated with each class, but a series of stage-wise reference hypotheses. Note that here we have relaxed the MaxEnt requirement for the reference hypothesis.
- Using a different feature to test each class hypothesis is not always a good idea. Some data can be “contaminated” with noise or interference, so they may not be suitable to test a hypothesis with just one feature. In this case, a class-specific feature mixture (CSFM) [17,18,19] may be appropriate. For the CSFM, we define a set of feature transformations . (We assume here that the number of feature transformations equals the number of classes, but this is not necessary.) Then, is constructed as a mixture density using all the features:where is the MaxEnt reference hypothesis corresponding to each feature transformation .
- To solve the classification problem (15), it is necessary to obtain a segment of data that can be classified into one of M classes. The problem is often not that simple, and the location of the classifiable “event” may be unknown within a longer data recording, or the data recording may contain multiple events from multiple classes. Using MaxEnt PDF projection, it is possible to solve the data segmentation problem simultaneously with the classification problem [20,21].
5.2. Other Applications
MaxEnt PDF projection has applications in the analysis of networks and feature transformations. For example in neural networks, it is possible to view a feed-forward neural network as a generative network, a duality relationship between two opposing types of networks [22]. In addition, the restricted Boltzmann machine (RBM) can be used as a PDF estimator with tractable distribution [13]. In feature inversion, MaxEnt PDF projection can be used to find MaxEnt point-estimates of the input data based on fixed values of the feature [9].
6. Conclusions
In this short paper, the method of maximum entropy PDF projection was presented as a generalization of Jaynes’ maximum entropy principle with moment constraints. The mathematical basis of maximum entropy PDF projection was reviewed and practical considerations and applications were presented.
Funding
This research was funded by Fraunhofer FKIE, Wachtberg, Germany.
Conflicts of Interest
The author declares no conflict of interest.
References
- Jaynes, E.T. Information Theory and Statistical Mechanics I. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
- Kesavan, H.K.; Kapur, J.N. The Generalized maximum Entropy Principle. IEEE Trans. Syst. Man Cybern. 1989, 19, 1042–1052. [Google Scholar] [CrossRef]
- Banavar, J.R.; Maritan, A.; Volkov, I. Applications of the principle of maximum entropy: From physics to ecology. J. Phys. Condens. Matter 2010, 22, 063101. [Google Scholar] [CrossRef] [PubMed]
- Holmes, D.E. (Ed.) Entropy, Special Issue on Maximum Entropy and Its Application; MDPI: Basel, Switzerland, 2018. [Google Scholar]
- Martino, A.D.; Martino, D.D. An introduction to the maximum entropy approach and its application to inference problems in biology. Heliyon 2018, 4, e00596. [Google Scholar] [CrossRef] [PubMed]
- Picone, J.W. Signal Modeling Techniques in Speech Recognition. Proce. IEEE 1993, 81, 1215–1247. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. Maximum entropy PDF projection: A review. AIP Conf. Proc. 2017. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. Maximum Entropy PDF Design Using Feature Density Constraints: Applications in Signal Processing. IEEE Trans. Signal Process. 2015, 63, 2815–2825. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. Uniform Manifold Sampling (UMS): Sampling the Maximum Entropy PDF. IEEE Trans. Signal Process. 2017, 65, 2455–2470. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. The PDF Projection Theorem and the Class-Specific method. IEEE Trans. Signal Process. 2003, 51, 672–685. [Google Scholar] [CrossRef]
- Kay, S.M.; Nuttall, A.H.; Baggenstoss, P.M. Multidimensional probability density function approximations for detection, classification, and model order selection. IEEE Trans. Signal Process. 2001, 49, 2240–2252. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. The Class-Specific Classifier: Avoiding the Curse of Dimensionality (Tutorial). IEEE Aerosp. Electron. Syst. Mag. Spec. Tutor. Add. 2004, 19, 37–52. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. Evaluating the RBM Without Integration Using PDF Projection. In Proceedings of the EUSIPCO 2017, Kos, Greece, 28 August–2 Septemper 2017. [Google Scholar]
- Nuttall, A.H. Saddlepoint Approximation and First-Order Correction Term to The Joint Probability Density Function of M Quadratic and Linear Forms in K Gaussian Random Variables With Arbitrary Means and Covariances; NUWC Technical Report 11262; US Naval Undersea Warfare Center: Newport, RI, USA, 2000. [Google Scholar]
- Nuttall, A.H. Joint Probability Density Function of Selected Order Statistics And the Sum of the Remaining Random Variables; NUWC Technical Report 11345; US Naval Undersea Warfare Center: Newport, RI, USA, 2002. [Google Scholar]
- Baggenstoss, P.M. Class-Specific Features in Classification. IEEE Trans. Signal Process. 1999, 47, 3428–3432. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. Optimal Detection and Classification of Diverse Short-Duration Signals. In Proceedings of the International Conference on Cloud Engineering, Boston, MA, USA, 11–14 March 2014; pp. 534–539. [Google Scholar]
- Baggenstoss, P.M. Class-specific model mixtures for the classification of time-series. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2014. [Google Scholar]
- Baggenstoss, P.M. Class-Specific Model Mixtures for the Classification of Acoustic Time-Series. IEEE Trans. AES 2016, 52, 1937–1952. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. A multi-resolution hidden Markov model using class-specific features. IEEE Trans. Signal Process. 2010, 58, 5165–5177. [Google Scholar] [CrossRef]
- Baggenstoss, P.M. Acoustic Event Classification using Multi-resolution HMM. In Proceedings of the European Signal Processing Conference (EUSIPCO) 2018, Rome, Italy, 3–7 September 2018. [Google Scholar]
- Baggenstoss, P.M. On the Duality Between Belief Networks and Feed-Forward Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 1–11. [Google Scholar] [CrossRef] [PubMed]
© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).