E-Mail Alert

Add your e-mail address to receive forthcoming issues of this journal:

Journal Browser

Journal Browser

Special Issue "Information Theory in Machine Learning and Data Science"

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory".

Deadline for manuscript submissions: closed (31 August 2017)

Special Issue Editor

Guest Editor
Prof. Dr. Maxim Raginsky

University of Illinois, Department of Electrical and Computer Engineering, 162 Coordinated Science Lab MC 228, 1308 W. Main St., Urbana, IL 61801, USA
Website | E-Mail
Interests: information theory; statistical machine learning; optimization and control

Special Issue Information

Dear Colleagues,

The field of machine learning and data science is concerned with the design and analysis of algorithms for making decisions, reasoning about the world, and knowledge extraction from massive amounts of data. On the one hand, the performance of machine learning algorithms is limited by the amount of predictively relevant information contained in the data. On the other hand, different procedures for accessing and processing data can be more or less informative. Claude Shannon originally developed information theory with communication systems in mind. However, it has also proved to be an indispensible analytical tool in the field of mathematical statistics, where it is used to quantify the fundamental limits on the performance of statistical decision procedures and to guide the processes of feature selection and experimental design.

The purpose of this Special Issue is to highlight the state-of-the-art in applications of information theory to the fields of machine learning and data science. Possible topics include, but are not limited to, the following:

  • Fundamental information-theoretic limits of machine learning algorithms
  • Information-directed sampling and optimization
  • Statistical estimation, optimization, and learning under information constraints
  • Information bottleneck methods
  • Information-theoretic approaches to adaptive data analysis
  • Information-theoretic approaches to feature design and selection
  • Estimation of information-theoretic functionals

Prof. Dr. Maxim Raginsky
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1500 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (9 papers)

View options order results:
result details:
Displaying articles 1-9
Export citation of selected articles as:

Research

Jump to: Other

Open AccessArticle Survey on Probabilistic Models of Low-Rank Matrix Factorizations
Entropy 2017, 19(8), 424; doi:10.3390/e19080424
Received: 12 June 2017 / Revised: 11 August 2017 / Accepted: 16 August 2017 / Published: 19 August 2017
PDF Full-text (589 KB) | HTML Full-text | XML Full-text
Abstract
Low-rank matrix factorizations such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are a large class of methods for pursuing the low-rank approximation of a given data matrix. The conventional factorization models are based on the assumption
[...] Read more.
Low-rank matrix factorizations such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are a large class of methods for pursuing the low-rank approximation of a given data matrix. The conventional factorization models are based on the assumption that the data matrices are contaminated stochastically by some type of noise. Thus the point estimations of low-rank components can be obtained by Maximum Likelihood (ML) estimation or Maximum a posteriori (MAP). In the past decade, a variety of probabilistic models of low-rank matrix factorizations have emerged. The most significant difference between low-rank matrix factorizations and their corresponding probabilistic models is that the latter treat the low-rank components as random variables. This paper makes a survey of the probabilistic models of low-rank matrix factorizations. Firstly, we review some probability distributions commonly-used in probabilistic models of low-rank matrix factorizations and introduce the conjugate priors of some probability distributions to simplify the Bayesian inference. Then we provide two main inference methods for probabilistic low-rank matrix factorizations, i.e., Gibbs sampling and variational Bayesian inference. Next, we classify roughly the important probabilistic models of low-rank matrix factorizations into several categories and review them respectively. The categories are performed via different matrix factorizations formulations, which mainly include PCA, matrix factorizations, robust PCA, NMF and tensor factorizations. Finally, we discuss the research issues needed to be studied in the future. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle Estimating Mixture Entropy with Pairwise Distances
Entropy 2017, 19(7), 361; doi:10.3390/e19070361
Received: 8 June 2017 / Revised: 8 July 2017 / Accepted: 12 July 2017 / Published: 14 July 2017
PDF Full-text (325 KB) | HTML Full-text | XML Full-text
Abstract
Mixture distributions arise in many parametric and non-parametric settings—for example, in Gaussian mixture models and in non-parametric estimation. It is often necessary to compute the entropy of a mixture, but, in most cases, this quantity has no closed-form expression, making some form of
[...] Read more.
Mixture distributions arise in many parametric and non-parametric settings—for example, in Gaussian mixture models and in non-parametric estimation. It is often necessary to compute the entropy of a mixture, but, in most cases, this quantity has no closed-form expression, making some form of approximation necessary. We propose a family of estimators based on a pairwise distance function between mixture components, and show that this estimator class has many attractive properties. For many distributions of interest, the proposed estimators are efficient to compute, differentiable in the mixture parameters, and become exact when the mixture components are clustered. We prove this family includes lower and upper bounds on the mixture entropy. The Chernoff α -divergence gives a lower bound when chosen as the distance function, with the Bhattacharyaa distance providing the tightest lower bound for components that are symmetric and members of a location family. The Kullback–Leibler divergence gives an upper bound when used as the distance function. We provide closed-form expressions of these bounds for mixtures of Gaussians, and discuss their applications to the estimation of mutual information. We then demonstrate that our bounds are significantly tighter than well-known existing bounds using numeric simulations. This estimator class is very useful in optimization problems involving maximization/minimization of entropy and mutual information, such as MaxEnt and rate distortion problems. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle Overfitting Reduction of Text Classification Based on AdaBELM
Entropy 2017, 19(7), 330; doi:10.3390/e19070330
Received: 2 May 2017 / Revised: 26 June 2017 / Accepted: 29 June 2017 / Published: 6 July 2017
PDF Full-text (922 KB) | HTML Full-text | XML Full-text
Abstract
Overfitting is an important problem in machine learning. Several algorithms, such as the extreme learning machine (ELM), suffer from this issue when facing high-dimensional sparse data, e.g., in text classification. One common issue is that the extent of overfitting is not well quantified.
[...] Read more.
Overfitting is an important problem in machine learning. Several algorithms, such as the extreme learning machine (ELM), suffer from this issue when facing high-dimensional sparse data, e.g., in text classification. One common issue is that the extent of overfitting is not well quantified. In this paper, we propose a quantitative measure of overfitting referred to as the rate of overfitting (RO) and a novel model, named AdaBELM, to reduce the overfitting. With RO, the overfitting problem can be quantitatively measured and identified. The newly proposed model can achieve high performance on multi-class text classification. To evaluate the generalizability of the new model, we designed experiments based on three datasets, i.e., the 20 Newsgroups, Reuters-21578, and BioMed corpora, which represent balanced, unbalanced, and real application data, respectively. Experiment results demonstrate that AdaBELM can reduce overfitting and outperform classical ELM, decision tree, random forests, and AdaBoost on all three text-classification datasets; for example, it can achieve 62.2% higher accuracy than ELM. Therefore, the proposed model has a good generalizability. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle Rate-Distortion Bounds for Kernel-Based Distortion Measures
Entropy 2017, 19(7), 336; doi:10.3390/e19070336
Received: 9 May 2017 / Revised: 16 June 2017 / Accepted: 2 July 2017 / Published: 5 July 2017
PDF Full-text (346 KB) | HTML Full-text | XML Full-text
Abstract
Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and
[...] Read more.
Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and the incurred distortion is characterized by the rate-distortion function. However, the rate-distortion functions associated with distortion measures involving kernel feature mapping have yet to be analyzed. We consider two reconstruction schemes, reconstruction in input space and reconstruction in feature space, and provide bounds to the rate-distortion functions for these schemes. Comparison of the derived bounds to the quantizer performance obtained by the kernel K -means method suggests that the rate-distortion bounds for input space and feature space reconstructions are informative at low and high distortion levels, respectively. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle The Expected Missing Mass under an Entropy Constraint
Entropy 2017, 19(7), 315; doi:10.3390/e19070315
Received: 7 June 2017 / Revised: 20 June 2017 / Accepted: 26 June 2017 / Published: 29 June 2017
PDF Full-text (253 KB) | HTML Full-text | XML Full-text
Abstract
In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in
[...] Read more.
In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in the sample (unseen mass) in terms of t and n. Here we study the same problem, where the world may be countably infinite, and the probability measure on it is restricted to have an entropy of at most h. We provide tight bounds on the maximum of the expected unseen mass, along with a characterization of the measures attaining this maximum. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle A Novel Geometric Dictionary Construction Approach for Sparse Representation Based Image Fusion
Entropy 2017, 19(7), 306; doi:10.3390/e19070306
Received: 17 April 2017 / Revised: 21 June 2017 / Accepted: 26 June 2017 / Published: 27 June 2017
Cited by 3 | PDF Full-text (4272 KB) | HTML Full-text | XML Full-text
Abstract
Sparse-representation based approaches have been integrated into image fusion methods in the past few years and show great performance in image fusion. Training an informative and compact dictionary is a key step for a sparsity-based image fusion method. However, it is difficult to
[...] Read more.
Sparse-representation based approaches have been integrated into image fusion methods in the past few years and show great performance in image fusion. Training an informative and compact dictionary is a key step for a sparsity-based image fusion method. However, it is difficult to balance “informative” and “compact”. In order to obtain sufficient information for sparse representation in dictionary construction, this paper classifies image patches from source images into different groups based on morphological similarities. Stochastic coordinate coding (SCC) is used to extract corresponding image-patch information for dictionary construction. According to the constructed dictionary, image patches of source images are converted to sparse coefficients by the simultaneous orthogonal matching pursuit (SOMP) algorithm. At last, the sparse coefficients are fused by the Max-L1 fusion rule and inverted to a fused image. The comparison experimentations are simulated to evaluate the fused image in image features, information, structure similarity, and visual perception. The results confirm the feasibility and effectiveness of the proposed image fusion solution. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle Face Verification with Multi-Task and Multi-Scale Feature Fusion
Entropy 2017, 19(5), 228; doi:10.3390/e19050228
Received: 18 March 2017 / Revised: 5 May 2017 / Accepted: 13 May 2017 / Published: 17 May 2017
PDF Full-text (2416 KB) | HTML Full-text | XML Full-text
Abstract
Face verification for unrestricted faces in the wild is a challenging task. This paper proposes a method based on two deep convolutional neural networks (CNN) for face verification. In this work, we explore using identification signals to supervise one CNN and the combination
[...] Read more.
Face verification for unrestricted faces in the wild is a challenging task. This paper proposes a method based on two deep convolutional neural networks (CNN) for face verification. In this work, we explore using identification signals to supervise one CNN and the combination of semi-verification and identification to train the other one. In order to estimate semi-verification loss at a low computation cost, a circle, which is composed of all faces, is used for selecting face pairs from pairwise samples. In the process of face normalization, we propose using different landmarks of faces to solve the problems caused by poses. In addition, the final face representation is formed by the concatenating feature of each deep CNN after principal component analysis (PCA) reduction. Furthermore, each feature is a combination of multi-scale representations through making use of auxiliary classifiers. For the final verification, we only adopt the face representation of one region and one resolution of a face jointing Joint Bayesian classifier. Experiments show that our method can extract effective face representation with a small training dataset and our algorithm achieves 99.71% verification accuracy on Labeled Faces in the Wild (LFW) dataset. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Open AccessArticle Consistent Estimation of Partition Markov Models
Entropy 2017, 19(4), 160; doi:10.3390/e19040160
Received: 1 March 2017 / Revised: 31 March 2017 / Accepted: 4 April 2017 / Published: 6 April 2017
Cited by 1 | PDF Full-text (291 KB) | HTML Full-text | XML Full-text
Abstract
The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions:
[...] Read more.
The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: what is the minimal number of parameters needed to specify a Markov chain and how to estimate these parameters. In order to answer these questions, we build a consistent strategy for model selection which consist of: giving a size n realization of the process, finding a model within the Partition Markov class, with a minimal number of parts to represent the process law. From the strategy, we derive a measure that establishes a metric in the state space. In addition, we show that if the law of the process is Markovian, then, eventually, when n goes to infinity, L will be retrieved. We show an application to model internet navigation patterns. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Other

Jump to: Research

Open AccessLetter Discovery of Kolmogorov Scaling in the Natural Language
Entropy 2017, 19(5), 198; doi:10.3390/e19050198
Received: 16 February 2017 / Revised: 25 April 2017 / Accepted: 26 April 2017 / Published: 2 May 2017
PDF Full-text (4928 KB) | HTML Full-text | XML Full-text
Abstract
We consider the rate R and variance σ2 of Shannon information in snippets of text based on word frequencies in the natural language. We empirically identify Kolmogorov’s scaling law in σ2k-1.66±0.12
[...] Read more.
We consider the rate R and variance σ 2 of Shannon information in snippets of text based on word frequencies in the natural language. We empirically identify Kolmogorov’s scaling law in σ 2 k - 1 . 66 ± 0 . 12 (95% c.l.) as a function of k = 1 / N measured by word count N. This result highlights a potential association of information flow in snippets, analogous to energy cascade in turbulent eddies in fluids at high Reynolds numbers. We propose R and σ 2 as robust utility functions for objective ranking of concordances in efficient search for maximal information seamlessly across different languages and as a starting point for artificial attention. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Figures

Figure 1

Journal Contact

MDPI AG
Entropy Editorial Office
St. Alban-Anlage 66, 4052 Basel, Switzerland
E-Mail: 
Tel. +41 61 683 77 34
Fax: +41 61 302 89 18
Editorial Board
Contact Details Submit to Entropy Edit a special issue Review for Entropy
logo
loading...
Back to Top