# Information Bottleneck Classification in Extremely Distributed Systems

^{*}

## Abstract

**:**

## 1. Introduction

**The challenges of decentralized classification.**There is a gap between the classification results of centralized and decentralized classification, even for simple datasets such as MNIST [15], for which a very high recognition accuracy was achieved on centralized models while the performance in the decentralized setting is quite modest. Studies such as [16,17] showed that semi-supervised classification is even a more challenging task for such systems. Anomaly detection and one-class classification are two subfamilies of unsupervised learning closely related to decentralized classification. Well-known methods such as one-class support vector machines [18] are proposed, but in practice they suffer from very slow training and limited performance. To the best of our knowledge, all recent advances in anomaly detection use generative models composed of encoding and decoding, with adversarially learned denoising to better separate in- and outlier images [19], or by adversarially learning a disentangled implicit representation [20], or by constraining a latent representation to generate only possible examples from the class and to avoid generating any example outside the class, no matter how close it is to the class [21]. These studies state without theoretical explanation that their models seek to project and compress the data distribution of a class in an optimal way to keep only the information necessary to identify the class in question while being able to regenerate the initial data from the compressed version with minimal error. However, the most recent studies clearly demonstrate the critical limitations of trained representations on single class data. As shown in [21], if the model is not sharpened to reject outlier samples, it can learn more generic information than is strictly necessary for a given class, in which case it is unable to isolate that class from the others. This is, for instance, the case in [21], where the one-class model was trained on class 8 and yet considers outlier classes 1, 5, 6 and 9 as inlier class 8.

## 2. Problem Formulation: The One Node–One Class Setup

## 3. Related Work

**An information bottleneck interpretation.**We use the Information Bottleneck (IB) principle presented in [6] to build the theory behind centralized and decentralized classification models. The analysis of the supervised and unsupervised information bottleneck problems was performed in [23] and generalized to the distributed setup in [24]. In this work, we extend the IBN to demonstrate the importance of compression in the form of vector quantization for the classification problem. Moreover, we show that the classical centralized training is a supervised IB scenario whereas the decentralized one is an instance of an unsupervised IB scenario as developed in [7] and summarized in Figure 1. Ideally, each node should: (a) store in its encoded parameters the in-class data information to ensure the distribution of one class to be distinct from the other ones, (b) be trained to compress and decompress optimally for in-class data, such that the reconstruction error is minimized (blue rate-distortion curve in Figure 4 of the matched case), and sub-optimally for out-of-class data, such that the reconstruction error is not minimum (orange rate-distortion curve of the same mismatched case), and (c) have a rate of compression (${R}_{Q}$ in Figure 4), which separates the optimal node from sub-optimal ones. Shannon’s rate-distortion theory assumes that the compression-decompression model used for the data compression should be jointly trained for input data statistics. This makes a link to optimal matched signal detection used in the theory of signal processing: each class has its own representative manifold and a corresponding filter represented by its proper encoder–decoder pair. The main difference with the matched filter, is that this filter is designed for one particular signal. Thus, the matched filter detects the closeness of the probe to the signal. In our framework, we validate the proximity of the signal to the entire class manifold represented by the ensemble of training data. However, it is not done by measuring the proximity of each available training in-class data point and aggregating the results, but instead by the trained model itself, ensuring a continuity of the learned data manifold that is achieved by the considered encoder–decoder system as whole. It is important to note that compression is not required for such learning. Instead, the compression is needed to distinguish in-class and out-class probes by providing higher reconstruction error for the out-class samples, as shown in Figure 4 for the mismatched case plot.

**Big-data and privacy-preserving classification.**In the considered setup, the notion of privacy concerns the training data sets that are kept locally in each node. No data sharing or model parameter sharing is required either between the local nodes or centralized server. Therefore, the training stage is considered to be privacy-preserving one. At the same time, we assume that the probe distributed by the centralized node for the classification is not considered to be privacy sensitive one at the classification stage. Therefore, no special measures are taken to preserve its privacy. At the same time, one can assume special obfuscation strategies for the probe protection like randomization of special dimensions in the embedded space and we refer an interested reader for the overview of such techniques in [25].

**Novelty and contribution:**

- We propose a fully distributed learning framework without any gradient communication to the centralized node as it is done in the distributed systems based on FL. As pointed out in [11,26] this resolves many common issues of FL related to the communication burden at the training stage and the need for gradient obfuscation for privacy reasons.
- We consider a new problem formulation of decentralized learning, where each node has an access only to the samples of some class. No communication between the nodes is assumed. We call this extreme case of Non-IID Federated Learning as ON-OC setup.
- We propose a theoretical model behind the proposed decentralized system based on the information bottleneck principle and justify the role of lossy feature compression as an important part of the information bottleneck implementation for the considered ON-OC classification.
- In contrast to the centralized classification systems and distributed Federated Learning, which both mimic the learning of decision boundaries between classes based on the simultaneously available training samples from all classes, we propose a novel approach, which tries to learn the data manifolds of each individual class at the local nodes and make the decision based on the proximity of a probe to each data manifold at the centralized node.
- The manifold learning is also accomplished in a new way using a system similar to an auto-encoder architecture [27] but keeping the encoder fixed for all classes. Thus, the only learnable parts of each node are compressor and decoder. This leads to the reduced training complexity and flexibility in the design of compression strategies. Additionally, by choosing the encoder based on the geometrically invariant network a.k.a. ScatNet [28], one can hope that the amount of training data needed to cope with the geometrical variability in training data might be reduced as suggested by the authors of [28].
- Finally, the proposed approach also differs to our previous framework [29] in the following way:
- The framework in [29] was not based on the IB principle, while the current work explicitly extends the IB framework.
- The previous work [29] did not use the compression in the latent space while the current work uses an explicit compression in a form of a vector quantization. The use of quantization is an important element of the IB framework in the considered ON-OC setup. In this work that the results of classification with the properly selected compression are considerably improved with respect to the unquantized latent space case considered in our prior work [29].
- The [29] was based on the concept of Variational Auto-Encoder (VAE), which includes the training of the encoder and decoder parts. This requires sufficient amount of data to obtain the invariance of the encoder to the different types of geometrical deviations. At the same time, the current work is based on the use of geometrically invariant transform, in particular ScatNet, which is designed to be invariant to the geometrical deviations. This allows, first of all, to avoid the training of encoder and, secondly, to train the system without big amount of labeled data or necessity to observe the data from all classes.
- In the case of VAE-based system the latent space is difficult to interpret in terms of the selection of dimensions for the quantization. In the case of use of ScanNet as an encoder part the latent space is well interpretable, and its different sub-bands correspond to different frequencies. In this respect, it becomes evident which sub-bands should be preserved and which ones could be suppressed (depending on the solved problem).
- Finally, this new setup shows higher classification accuracy for the ON-OC setup.

## 4. Theoretical Model

#### 4.1. Information Bottleneck Concept of Centralized Systems

- A minimization of ${H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)$ such that $\mathbf{Z}$ should contain as little information as possible about $\mathbf{X}$ for compression purposes; therefore one has to compress at the encoding $\mathbf{X}\stackrel{{\mathbf{q}}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})}{\to}\mathbf{Z}$. In general, this compressing encoding is learned by optimizing $\mathit{\varphi}$. We simplified the learning process by using a deterministic compression map $\mathbf{Z}={Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{X}\right)\right)$, where ${f}_{\mathit{\varphi}}(\xb7)$ is a feature extractor and ${Q}_{\mathit{\varphi}}(\xb7)$ is a vector quantizer. Accordingly, the rate ${R}_{Q}={H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)\le {log}_{2}K$ is determined by the number of centroids K in the considered vector quantizer, with equality, if and only if all centroids are equiprobable.
- A maximization of ${H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})$ under the deterministic encoding $\mathbf{Z}={Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{X}\right)\right)$ reduces to zero and thus: ${H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})=0$ in Equation (3).
- A minimization of ${H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z})$, which represents the cross-entropy between the distribution of the true labels $p\left(\mathbf{m}\right)$ and the estimated ones ${p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})$:$${H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z})=-{\mathbb{E}}_{p(\mathbf{x},\mathbf{m})}\left[{\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})}\left[{log}_{2}{p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})\right]\right],$$

#### 4.2. Information Bottleneck Concept of Decentralized Systems

## 5. Implementation Details

#### 5.1. Training of Local Encoders

#### 5.1.1. Structure of the Scattering Transform

#### 5.1.2. Training of Local Quantizers

#### 5.2. Training of Local Decoders

#### 5.3. Central Classification Procedure

- the Manhattan distance ${d}_{{\ell}_{1}}$,
- the perceptual distance ${d}_{VGG}$ defined in [37],
- the pseudo-distance ${d}_{t}$, which counts the number of pixels with an absolute error larger than a threshold t:$$\begin{array}{c}\hfill \begin{array}{cc}\hfill {d}_{t}(\widehat{\mathbf{x}},\mathbf{x})=\sum _{i=1}^{{N}_{\mathbf{x}}}{\U0001d7d9}_{\left|\widehat{\mathbf{x}}\left[i\right]-\mathbf{x}\left[i\right]\right|\ge t},\phantom{\rule{4pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}{\U0001d7d9}_{\left|\widehat{\mathbf{x}}\left[i\right]-\mathbf{x}\left[i\right]\right|\ge t}& =\left\{\begin{array}{cc}& 1,\phantom{\rule{4.pt}{0ex}}\mathrm{if}\phantom{\rule{4.pt}{0ex}}\left|\widehat{\mathbf{x}}\left[i\right]-\mathbf{x}\left[i\right]\right|\ge t,\hfill \\ & 0,\phantom{\rule{4.pt}{0ex}}\mathrm{else}.\hfill \end{array}\right.\hfill \end{array}\end{array}$$

## 6. Experiments

#### 6.1. Results

#### 6.1.1. MNIST

#### 6.1.2. FashionMNIST

## 7. Discussion

#### 7.1. Investigation of the Bottleneck Role

#### 7.2. One-Class Manifold Learning for Separability

#### 7.3. Influence of Feature Selection and Link to the Rate of Compression

#### 7.3.1. Influence of the Parameter ${i}^{\star}$

#### 7.3.2. Influence of the Parameter K

- $K=5$ achieves smallest classification error in the central node,
- near $K=5$ there is a smooth behavior and $K=5$ remains optimal in terms of classification.
- $K=1$ leads to the overfitting as the table shows a drop of performance between the train and the test datasets,
- for $K>5$, the table shows a drop in performance due to non-separability of rates of distortions between nodes.

## 8. Conclusions

**Shannon’s Rate-Distortion theory and IB principles:**The main novelty is that we introduce a compression by partially suppressing and quantizing information in the latent representations of untrained feature extractors. We introduce higher reconstruction errors for more separability of the classes like in Figure 4. We demonstrate that a central node which does not share any information about classes for training and classification, can achieve competitive classification performance in comparison to classical systems.

**Compression principle:**Following the IB principles for two close classes, one should learn only what makes these classes unique, and compress common data in the latent representation. Scatnet provides universality and interpretability of its representations. We only quantize the first channel corresponding to a blurred image of the probe. This is the most common component of the dataset. Thus, we suppress much information in the first channels while retaining the last channels which hold the high frequency information of the probe and are unique for each class. This introduces more separability in the learned manifolds. Nevertheless, we should keep enough information to reconstruct accurately for the inliers.

**Choice of parametersJ and ${i}^{\star}$:**For simple datasets like MNIST, we can suppress many ScatNet channels, and still retain enough information to accurately reconstruct the inliers. For a more complex dataset, we should suppress less information (${i}^{\star}$ smaller). With more scattering features (J larger), one can maintain separability. A trade-off expressed in Equation (10) is made between the rate and the distortion, and could also be optimized to learn these parameters. One can use a recent framework [48] to estimate mutual information between the channels of ScatNet to choose which channels contain common information to be suppressed by quantization. This will be our future line of research for more complex datasets.

**Back to “matched filtering” based on auto-encoding**: We show that it is possible to reach and even outperform more classical centralized deep-learning architectures implemented in a federated, decentralized model. The most adopted interpretation of a deep-learning-based classification paradigm is that it can capture and accurately approximate the decision boundaries between the classes in the multi-dimensional space. In return, it requires having all data in a common place to learn these boundaries. The state-of-the-art distributed deep-learning classification system mostly targets to optimize the rate of gradient exchange and potential leakages at the training stage in communication between the nodes and the centralized server. In contrast, we practically demonstrate that our classifier can be trained in a completely distributed way, when each node has access to data of its own class, gradients are not shared and other classes are unknown. Thus, the decision boundaries between the classes cannot be learned as such. In return, it suggests that we learn and encode a manifold of each class and only test the closeness of probe to this class at the testing stage. This conceptually link the proposed approach with the well-known in-signal processing concept of matched filtering.

**Data-management advantages**: Another consequence of the proposed framework is a possibility to decentralize the data to analyze and classify it. Such a method would allow the partition of work for analyzing data between different independent servers. Each pair of encoder–decoder might be independently trained with different training data, rendering big and maybe confidential data transfers unnecessary.

**Future work**: For the future research we aim at investigating the proposed framework on more complex datasets like, for example, ImageNet [49], Indoor Scene Recognition [50], Labeled Faces in the Wild [51]. The investigation of a robustness of the proposed framework against the adversarial attacks is an important open question for the future work as well as the studying of unbalanced decentralized systems where some classes could come from similar distributions or the situation where nodes could own different proportions of training data.

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

## Appendix A. Bottleneck Interpretation

**Figure A1.**For 10 MNIST samples: (1) their encoded representation, which is a scattering transformation of deepness J = 2 and (2) the first step of compression used for the bottleneck of our experimental setup by the selection of only the two extreme channels inside violet frames. These two steps are deterministic and identical for each node. The second step of compression consists of quantizing the channel of deepness 0 with one of the dictionaries described in Figure A2, depending on the node where the process occurs.

**Figure A2.**MNIST dictionaries for nodes used for the quantization of the first channel of the scattering transform. This first channel is a reduced version of the image obtained by a Gaussian blurring. The dictionary for one node consists of the centroids resulting from K-means applied to the training data of a node, with K = 5.

## Appendix B. Nodes Analysis

#### Appendix B.1. One-Class Manifold Learning

**Figure A3.**MNIST manifolds description by tSNE in terms of in- and outliers for raw data and reconstructed errors $RE$ with the classifying non-linearity ${d}_{.4}$. Figure 7 shows that inliers have visually quite good reconstructions whereas outliers reconstructions are visually different from their raw version. This visual separability is confirmed by the modified in- and outliers manifolds for each node. For the raw data, the parameters are perplexity P = 50, learning rate LR = 200 and number of iterations IT = 1000. For the reconstructed, parameters are given in the detailed Figure A4 and Figure A5.

**Figure A4.**Manifold learning of our setup: for each node from label 0 to label 4, tSNEs show the manifolds for MNIST data at different stages of the proposed local auto-encoders. In orange are inliers samples (local data used for training, of the same class as the node label) and in blue are outliers samples(data unseen during training which originates from a different class). First column: for the output of the bottleneck; second column: for the reconstructed samples at the output of the decoders; third column: for the reconstruction errors with the original samples; and last column: for the reconstruction errors after application of the non-linearity metric ${d}_{{.}_{4}}$ which classifies best at the central node. P stands for perplexity, LR for learning rate, and IT is the number of iterations. We can see that each step plays a role for the local manifold learning and separation power with outliers.

**Figure A5.**Manifold learning of our setup: for each node from label 5 to label 9, tSNEs show the manifolds for MNIST data at different stages of the proposed local auto-encoders. In orange are inliers samples (local data used for training, of the same class as the node label) and in blue are outliers samples(data unseen during training which originates from a different class). First column: for the output of the bottleneck; second column: for the reconstructed samples at the output of the decoders; third column: for the reconstruction errors with the original samples; and last column: for the reconstruction errors after application of the non-linearity metric ${d}_{.4}$ which classifies best at the central node. P stands for perplexity, LR for learning rate, and IT is the number of iterations. We can see that each step plays a role for the local manifold learning and separation power with outliers.

#### Appendix B.2. Influence of the Rate

**Figure A6.**Rate-Distortion curves on MNIST distributions for each local node. The distortion on the y-axis is measured by the ${\ell}_{1}$ norm of the reconstruction error, and the rate on the x-axis is represented by the value of ${i}^{\star}$, the scattering channel index from which the information is kept at the compression stage, as described in Section 5.1.2; the greater ${i}^{\star}$ is, the more compression occurs. The blue curves are for inliers (samples with the same label as the local one used to train the node) and the orange curves are for outliers. For each node, there is a higher separability when ${i}^{*}=80$ which corresponds to the final setup presented in Section 6.1. These curves correspond to the theoretical ones presented in Figure 4 and ${i}^{*}=80$ corresponds to the optimal ${R}_{Q}$ for the best classification achieved. We note that for some nodes, when ${i}^{*}=41$, the distortion is larger for inliers than outliers, maybe the structure of the ScatNet representation is the cause. It also happens for node 8 when ${i}^{*}\in \{1,9,41\}$ which means that for these rates, outliers are better reconstructed by node 8 than samples of 8. This problem has also been commented in [21] when the classical outlier detector trained on the class 8 of MNIST perfectly reconstructs outliers. The authors in [21] bypass this problem with some structural tricks and regularization for the bottleneck of their model which actually increase the rate of compression. In our setup, when ${i}^{*}=80$, the node for label 8 reconstructs inliers better than outliers and can separate.

## References

- Delalleau, O.; Bengio, Y. Parallel Stochastic Gradient Descent; CIAR Summer School: Toronto, ON, Canada, 2007. [Google Scholar]
- Tian, L.; Jayaraman, B.; Gu, Q.; Evans, D. Aggregating Private Sparse Learning Models Using Multi-Party Computation. In Proceedings of the Private MultiParty Machine Learning (NIPS 2016 Workshop), Barcelona, Spain, 8 December 2016. [Google Scholar]
- McMahan, H.B.; Moore, E.; Ramage, D.; y Arcas, B.A. Federated Learning of Deep Networks using Model Averaging. arXiv
**2016**, arXiv:1602.05629. [Google Scholar] - McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
- Oyallon, E.; Belilovsky, E.; Zagoruyko, S. Scaling the Scattering Transform: Deep Hybrid Networks. arXiv
**2017**, arXiv:1703.08961. [Google Scholar] - Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
- Voloshynovskiy, S.; Kondah, M.; Rezaeifar, S.; Taran, O.; Holotyak, T.; Rezende, D.J. Information bottleneck through variational glasses. In Proceedings of the Bayesian Deep Learning (NeurIPS 2019 Workshop), Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
- Gibiansky, A. Bringing HPC Techniques to Deep Learning. Available online: https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/ (accessed on 28 October 2020).
- You, S.; Xu, C.; Xu, C.; Tao, D. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1285–1294. [Google Scholar]
- Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble Distillation for Robust Model Fusion in Federated Learning. arXiv
**2020**, arXiv:2006.07242. [Google Scholar] - Asad, M.; Moustafa, A.; Ito, T.; Aslam, M. Evaluating the Communication Efficiency in Federated Learning Algorithms. arXiv
**2020**, arXiv:2004.02738. [Google Scholar] - Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv
**2018**, arXiv:1806.00582. [Google Scholar] - Hsieh, K.; Phanishayee, A.; Mutlu, O.; Gibbons, P.B. The Non-IID Data Quagmire of Decentralized Machine Learning. arXiv
**2019**, arXiv:1910.00189. [Google Scholar] - Fung, C.; Yoon, C.J.M.; Beschastnikh, I. Mitigating Sybils in Federated Learning Poisoning. arXiv
**2018**, arXiv:1808.04866. [Google Scholar] - Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag.
**2012**, 29, 141–142. [Google Scholar] [CrossRef] - Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-Supervised Learning with Deep Generative Models. arXiv
**2014**, arXiv:1406.5298. [Google Scholar] - Gordon, J.; Hernández-Lobato, J.M. Bayesian Semisupervised Learning with Deep Generative Models. arXiv
**2017**, arXiv:1706.09751. [Google Scholar] - Tax, D.M.; Duin, R.P. Support vector data description. Mach. Learn.
**2004**, 54, 45–66. [Google Scholar] [CrossRef] [Green Version] - Sabokrou, M.; Khalooei, M.; Fathy, M.; Adeli, E. Adversarially Learned One-Class Classifier for Novelty Detection. arXiv
**2018**, arXiv:1802.09088. [Google Scholar] - Pidhorskyi, S.; Almohsen, R.; Adjeroh, D.A.; Doretto, G. Generative Probabilistic Novelty Detection with Adversarial Autoencoders. arXiv
**2018**, arXiv:1807.02588. [Google Scholar] - Perera, P.; Nallapati, R.; Xiang, B. OCGAN: One-class Novelty Detection Using GANs with Constrained Latent Representations. arXiv
**2019**, arXiv:1903.08550. [Google Scholar] - Dewdney, P.; Turner, W.; Braun, R.; Santander-Vela, J.; Waterson, M.; Tan, G.H. SKA1 System Baselinev2 Description; SKA Organisation: Macclesfield, UK, 2015. [Google Scholar]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv
**2016**, arXiv:1612.00410. [Google Scholar] - Estella-Aguerri, I.; Zaidi, A. Distributed variational representation learning. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 1. [Google Scholar] [CrossRef] [Green Version] - Razeghi, B.; Stanko, T.; Škoric´, B.; Voloshynovskiy, S. Single-Component Privacy Guarantees in Helper Data Systems and Sparse Coding with Ambiguation. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Delft, The Netherlands, 9–12 December 2019. [Google Scholar]
- Chen, Y.; Sun, X.; Jin, Y. Communication-Efficient Federated Deep Learning with Asynchronous Model Update and Temporally Weighted Aggregation. arXiv
**2019**, arXiv:1903.07424. [Google Scholar] [CrossRef] - Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science
**2006**, 313, 504–507. [Google Scholar] [CrossRef] [Green Version] - Bruna, J.; Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1872–1886. [Google Scholar] [CrossRef] [Green Version] - Rezaeifar, S.; Taran, O.; Voloshynovskiy, S. Classification by Re-generation: Towards Classification Based on Variational Inference. arXiv
**2018**, arXiv:1809.03259. [Google Scholar] - Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Zhang, Y.; Ozay, M.; Sun, Z.; Okatani, T. Information Potential Auto-Encoders. arXiv
**2017**, arXiv:1706.04635. [Google Scholar] - Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Mallat, S. Group invariant scattering. Commun. Pure Appl. Math.
**2012**, 65, 1331–1398. [Google Scholar] [CrossRef] [Green Version] - Bernstein, S.; Bouchot, J.L.; Reinhardt, M.; Heise, B. Generalized analytic signals in image processing: Comparison, theory and applications. In Quaternion and Clifford Fourier Transforms and Wavelets; Birkhäuser: Basel, Switzerland, 2013; pp. 221–246. [Google Scholar]
- Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Is L2 a Good Loss Function for Neural Networks for Image Processing. arXiv
**2015**, arXiv:1511.08861. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. arXiv
**2016**, arXiv:1609.04802. [Google Scholar] - Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv
**2017**, arXiv:1708.07747. [Google Scholar] - Byerly, A.; Kalganova, T.; Dear, I. A Branching and Merging Convolutional Network with Homogeneous Filter Capsules. arXiv
**2020**, arXiv:2001.09136. [Google Scholar] - Hirata, D.; Takahashi, N. Ensemble learning in CNN augmented with fully connected subnetworks. arXiv
**2020**, arXiv:2003.08562. [Google Scholar] - Kowsari, K.; Heidarysafa, M.; Brown, D.E.; Meimandi, K.J.; Barnes, L.E. RMDL: Random Multimodel Deep Learning for Classification. arXiv
**2018**, arXiv:1805.01890. [Google Scholar] - Harris, E.; Marcu, A.; Painter, M.; Niranjan, M.; Prügel-Bennett, A.; Hare, J. FMix: Enhancing Mixed Sample Data Augmentation. arXiv
**2020**, arXiv:2002.12047. [Google Scholar] - Bhatnagar, S.; Ghosal, D.; Kolekar, M.H. Classification of fashion article images using convolutional neural networks. In Proceedings of the 2017 Fourth International Conference on Image Information Processing (ICIIP), Shimla, India, 21–23 December 2017. [Google Scholar]
- Hao, W.; Mehta, N.; Liang, K.J.; Cheng, P.; El-Khamy, M.; Carin, L. WAFFLe: Weight Anonymized Factorization for Federated Learning. arXiv
**2020**, arXiv:2008.05687. [Google Scholar] - Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Hahnloser, R.H.R.; Sarpeshkar, R.; Mahowald, M.A.; Douglas, R.J.; Seung, H.S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature
**2000**, 405, 947–951. [Google Scholar] [CrossRef] [PubMed] - HasanPour, S.H.; Rouhani, M.; Fayyaz, M.; Sabokrou, M. Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures. arXiv
**2016**, arXiv:1608.06037. [Google Scholar] - Belghazi, M.I.; Baratin, A.; Rajeswar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, R.D. MINE: Mutual Information Neural Estimation. arXiv
**2018**, arXiv:1801.04062. [Google Scholar] - Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Quattoni, A.; Torralba, A. Recognizing indoor scenes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 413–420. [Google Scholar]
- Huang, G.B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments Technical Report 07-49; University of Massachusetts: Amherst, MA, USA, 2007. [Google Scholar]

**Figure 1.**Theoretical and practical differences between centralized and decentralized training: (

**a**) for the centralized training, the model can access all available data and, therefore, learn decision boundaries between classes. It is usually a single supervised classifier. More generally, it can be decomposed into an encoder followed by a classifier: the data manifold is projected by E onto a constrained space to make the work of C simpler as in [5]. In theoretical terms, this model is justified by the Information Bottleneck (IB) principle [6] described by the Markov chain above and corresponds to the IB for supervised models described in [7]; (

**b**) for the fully decentralized training, we assume the scenario, where each node has an access to the training data of one class only. The model cannot learn the decision boundaries between classes contrary to the centralized one. Each node is following the unsupervised IB model described in [7]. They share the same E and D structure but the parameters of encoders ${\mathit{\varphi}}_{0},\cdots ,{\mathit{\varphi}}_{{N}_{\mathbf{m}}}$, and decoders ${\mathit{\theta}}_{0},\cdots ,{\mathit{\theta}}_{{N}_{\mathbf{m}}}$ are learned for each class individually, given by the data manifold of each class. At the classification stage, the nodes share only the reconstruction errors with the central node.

**Figure 2.**Classification setup under the analysis in this paper. It shows which parameters of the system are under the privacy protection and which are shared in the public domain. Also, the hyperparameters such as the learning rate and the number of epochs are in the public domain, sent from the central node to all local nodes.

**Figure 3.**Conceptual difference between (

**a**) the centralized classifications and (

**b**) the extremely decentralized ON-OC classification. Colors represent the manifolds of each learned class. (

**a**) Centralized and Federated classification; (

**b**) ON-OC classification.

**Figure 4.**At testing time, the probe $\mathbf{x}$ is sent from the central node to ${N}_{\mathbf{m}}$ compression-decompression local nodes, trained on their own data (“TR.” denotes “trained”), e.g., ${N}_{\mathbf{m}}=3$ for this example. The results of decompression expressed in the reconstruction error denoted as ${e}_{1}$, ${e}_{2}$ and ${e}_{3}$ are sent back to the central node. The proposed distributed model classifies in favor of the smallest reconstruction error. The compression in each node is characterized by a compression rate ${R}_{Q}$, which is chosen to be such that the distortion distributions for mismatched classes are maximized with respect to the matched case.

**Figure 5.**A detailed architecture of proposed model for a local node compression: (

**a**) the generation of a dictionary to quantize a portion of the scattering transform feature vector, (

**b**) the processing chain for encoding, compression and regeneration.

**Figure 6.**Scattering representation, feature selection and compression used for the bottleneck of our experimental setup. This figure shows the encoded representation for two MNIST samples, with the scattering transformation of deepness $J=2$. The feature selection is represented by the violet frames at scattering deepnesses 0 and 2, which selects only the two extreme channels of these representations. These two steps are deterministic and identical for each node. The second step of compression consists of quantizing the channel of deepness 0 with a node-dependent dictionary as shown in Figure A2.

**Figure 7.**(

**a**,

**b**): Examples of reconstructions and classification on class 3 samples from (

**a**,

**b**) FashionMNIST dataset. The first column is for the probe and the following columns are for the results of the local nodes. The first row is for the names of the local nodes, rows $\mathbf{A}$ are for the probes and their reconstructions, rows $\mathbf{B}$ are for spatial errors, whereas rows $\mathbf{C}$ count the number votes for the corresponding node label given by the experimented classifying metrics among ${d}_{{\ell}_{1}}$, ${d}_{VGG}$ and ${\left\{{d}_{t}\right\}}_{t}$. One distance is incorrect in (

**a**): ${d}_{0}$ vote for $m=2$. Nine distances are incorrect in (

**b**): ${d}_{0},{d}_{.13},\cdots ,{d}_{.19}$ vote for $m=0$ and ${d}_{.9}$ vote for $m=6$. (

**c**) TensorBoard of the converging training of the 10 local nodes.

**Figure 8.**Detailed presentation of the different steps of compression for the local node 6, defined in Section 5.1.2, from the scattering representation of an outlier with label 3 to its compressed representation.

**Figure 9.**tSNEs showing representations of the MNIST data manifolds at different steps of the classifying process: (

**a**) for the raw data, (

**b**) for their scattering representations as output of ScatNet with all deepnesses and coefficients shown in Figure 8, (

**c**) after suppressing the 79 intermediate channels, when only the two blue framed channels of the scattering representation are kept as shown in Figure 8, and (

**d**) after quantization of the first channel by node quantizers of the same class as the samples as shown in the latent space representation of Figure 8, and whose dictionaries are shown in Figure A2. P, LR and IT respectively stand for the perplexity, the learning rate and the number of iterations of the tSNEs.

**Figure 10.**tSNEs showing representations of inliers and outliers on MNIST data manifolds for the node of label 9, at different steps of the encoding-decoding process. inliers are samples of label 9 and outliers are samples of the rest of labels: (

**a**) for the raw data, (

**b**) for the data in node 9 after ScatNet and compression as described in Section 5.1.2, (

**c**) for the data in node 9 after reconstruction by the decoder, (

**d**) for the error of reconstruction with the original samples as shown in raw $\mathbf{B}$ of Figure 7a, and (

**e**) for error of reconstruction after application of the optimal thresholding with $t=0.4$ presented in Section 5.3 and Table 3. More results for all nodes are shown in Figure A3, Figure A4 and Figure A5. P, LR and IT respectively stand for the perplexity, the learning rate and the number of iterations of the tSNEs.

**Figure 11.**(

**a**) presents how different rates of compression are achieved in order to obtain the rate-distortion curves (

**b**) for the reconstructions at node 7, and (

**c**) for the accuracies of classification at the central node. In (

**a**) different rates are achieved by changing the index ${i}^{\star}$ defined in Section 5.1.2 before which the scattering channels are suppressed during compression. In (

**b**), the distortion on the y-axis is measured by the ${\ell}_{1}$ norm of the reconstruction error, and the rate on the x-axis is represented by the value of ${i}^{\star}$, the scattering channel index from which the information is kept at the compression stage, as described in Section 5.1.2; the greater ${i}^{\star}$ is, the more compression occurs. The blue curve is for inliers (samples of label 7 used to train the node) and the orange curve is for outliers. In (

**c**) the classification accuracies are given on the training and testing dataset for different rates represented by ${i}^{\star}$. The number of centroids for the “0”-subband of ScatNet was fixed to $K=5$. (

**a**) Index ${i}^{\star}$ and rate of compression; (

**b**) Rate-distortion curve for the local node of label 7 on MNIST data; (

**c**) Classification error (the lower - the better) on MNIST data for the central node.

**Table 1.**The number of growing scales paths until the deepness $J=3$. Each deepness parameters ${j}_{d},{\alpha}_{d}$ in a given path are parametrized by $0\le \frac{\alpha}{2\pi}L<L$ for the rotations and $1\le {j}_{d-1}<{j}_{d}<{j}_{d+1}\le J$ for the scales. ${N}_{{s}^{J}}$ is the total number of scattering features channels given for deepness J, H is the height and W the width of $\mathbf{x}$. These values are for gray-scaled images ($\times 3$ for RGB pictures).

Scattering Features for One Given | Number of | ${\mathit{S}}^{2}\left(\mathit{x}\right)$ | ${\mathit{S}}^{3}\left(\mathit{x}\right)$ | Tensors | ${\mathit{S}}^{2}\left(\mathit{x}\right)$ | ${\mathit{S}}^{3}\left(\mathit{x}\right)$ |
---|---|---|---|---|---|---|

Path by Growing Deepness | Channels | $(\mathit{J}=2)$ | $(\mathit{J}=3)$ | Sizes | $(\mathit{J}=2)$ | $(\mathit{J}=3)$ |

$\mathbf{x}\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | 1 | 1 | 1 | ${N}_{{S}^{J}}$ | 81 | 729 |

$\left|\mathbf{x}\star {\psi}_{{j}_{1}}^{{\alpha}_{1}}\right|\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | $JL$ | 16 | 24 | $Height$ | $H/4$ | $H/8$ |

$\left|\left|\mathbf{x}\star {\psi}_{{j}_{1}}^{{\alpha}_{1}}\right|\star {\psi}_{{j}_{2}}^{{\alpha}_{2}}\right|\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | $\left(\genfrac{}{}{0pt}{}{J}{2}\right){L}^{2}$ | 64 | 192 | $Width$ | $W/4$ | $W/8$ |

$\left|\left|\left|\mathbf{x}\star {\psi}_{{j}_{1}}^{{\alpha}_{1}}\right|\star {\psi}_{{j}_{2}}^{{\alpha}_{2}}\right|\star {\psi}_{{j}_{3}}^{{\alpha}_{3}}\right|\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | $\left(\genfrac{}{}{0pt}{}{J}{3}\right){L}^{3}$ | 0 | 512 |

**Table 2.**The decoder ${D}_{\mathit{\theta}}$ batch-normalizes and convolves the compressed scattering representation $\mathbf{z}$; then it chains cycles of deconvolutions with batch-normalizations and ReLu activation functions until the probe size is recovered. The last activation function is the hyperbolic tangent. $c=1$ for gray-scaled images and $c=3$ for $RGB$ images. The deepness J of the scattering encoding determines as well the deepness of the decoder.

Stage | Number of Channels | Filter Size | Stride | Size Scale | Activation |
---|---|---|---|---|---|

input ${\mathbf{z}}_{m}={E}_{{\mathit{\varphi}}_{m}}\left(\mathbf{x}\right)$ | ${N}_{\mathbf{z}}$ | $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${2}^{J}$}\right.$ | |||

Batch Normalization | |||||

Convolution | ${2}^{3(J+1)}c$ | $3\times 3$ | $1\times 1$ | $ReLU$ | |

Deconvolution | ${2}^{3J}c$ | $4\times 4$ | $2\times 2$ | $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${2}^{J-1}$}\right.$ | |

Batch Normalization | $ReLU$ | ||||

Deconvolution | ${2}^{3(J-1)}c$ | $4\times 4$ | $2\times 2$ | $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${2}^{J-2}$}\right.$ | |

Batch Normalization | $ReLU$ | ||||

⋮ | |||||

Deconvolution, output: $\widehat{\mathbf{x}}$ | c | $4\times 4$ | $2\times 2$ | 1 | $tanh$ |

**Table 3.**MNIST classification error on the training and testing datasets for our One Node–One Class–Information Bottleneck Classification (ON–OC–IBC) setup with different classifying metrics, compared with the state-of-the-art centralized methods BMCNN+HC [39], EnsNet [40] and RMDL [41], which are based on merging sub-networks or aggregating their sub-predictions by majority voting, and the state-of-the-art Federated Averaging (FedAvg) on IID and Non-IID setup given in [12], where the IID setup corresponds to 10 nodes each with a uniform partition of the data of the 10 classes, and the Non-IID result is given with a similar setup as ours, with 10 local nodes and one class data per node, and differs to our setup by the fact that gradients are shared across local nodes.

Centralized Methods | FedAvg | |||||||
---|---|---|---|---|---|---|---|---|

Method | BMCNN + HC | EnsNet | RMDL | IID | Non-IID | |||

Testing Data Error | $\mathbf{0}.\mathbf{16}$ | $\mathbf{0}.\mathbf{16}$ | $0.18$ | $1.43$ | $\mathbf{7}.\mathbf{77}$ | |||

Proposed fully decentralized ON–OC–IBC | ||||||||

Method | ${d}_{{\ell}_{1}}$ | ${d}_{VGG}$ | ${d}_{.2}$ | ${d}_{.3}$ | ${d}_{.4}$ | ${d}_{.5}$ | ${d}_{.6}$ | ${d}_{.7}$ |

Training data error | $1.5$ | $\mathbf{0}$ | $3.1$ | $1.5$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $1.5$ |

Testing data error | $4.6$ | $3.1$ | $1.5$ | $3.1$ | $\mathbf{0}$ | $4.6$ | $6.2$ | $7.8$ |

**Table 4.**FashionMNIST classification error on testing dataset for the proposed ON–OC–IBC setup with different classifying metrics, compared with the state-of-the-art centralized methods such as RN18+FMix [42], which is a Mixed Sample Data Augmentation that uses binary masks obtained by applying a threshold to low frequency images sampled from Fourier space, and with classical CNN, CNN++ and LSTM described in [43], and the state-of-the-art Federated Learning methods such as FedAvg and WAFFLe [44], the Weight Anonymized Factorization for Federated Learning that combines the Indian Buffet Process with a shared dictionary of weight factors for neural networks. The results of these two methods are given for the Non-IID setup with only $Z=2$ data classes stored in each local nodes, either in a unimodal (Uni) way, with a 1:1 ratio of data present from both classes, or a multimodal (Multi) way, with a 1:5 ratio of data in each local node. For the ON–OC–IBC setup proposed, only $Z=1$ data class is stored in each local node, and there is no data distribution ratio, but the number of local nodes used is exactly the number of classes.

Centralized Methods | FedAvg | WAFFLe | ||||||
---|---|---|---|---|---|---|---|---|

Method | RN18+FMix | CNN | CNN++ | LSTM | Uni | Multi | Uni | Multi |

Testing data error | $\mathbf{3}.\mathbf{64}$ | $8.83$ | $7.46$ | $11.74$ | $16.04$ | $16.57$ | $\mathbf{12}.\mathbf{88}$ | $13.91$ |

Proposed fully decentralized ON–OC–IBC | ||||||||

Method | ${d}_{{\ell}_{1}}$ | ${d}_{VGG}$ | ${d}_{.2}$ | ${d}_{.3}$ | ${d}_{.4}$ | |||

Testing data error | $\mathbf{10}.\mathbf{1}$ | $12.2$ | 12 | $13.1$ | $14.4$ |

**Table 5.**Error of classification on the MNIST train and test datasets, for different values of the quantization parameter K, ${i}^{\star}=80$ being set. We use the notation $K=\infty $ when no quantization is performed on the first channel of the scattering latent representation. The classification metric is ${d}_{.4}$.

K | 1 | 4 | 5 | 6 | 15 | 20 | 50 | 100 | ∞ |
---|---|---|---|---|---|---|---|---|---|

on train (%) | $80.4$ | $19.1$ | 0 | $19.0$ | $89.6$ | $91.1$ | $91.8$ | $91.1$ | $90.2$ |

on test (%) | $90.2$ | $23.9$ | 0 | $24.4$ | $89.7$ | $91.2$ | $92.1$ | $91.2$ | $90.2$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ullmann, D.; Rezaeifar, S.; Taran, O.; Holotyak, T.; Panos, B.; Voloshynovskiy, S.
Information Bottleneck Classification in Extremely Distributed Systems. *Entropy* **2020**, *22*, 1237.
https://doi.org/10.3390/e22111237

**AMA Style**

Ullmann D, Rezaeifar S, Taran O, Holotyak T, Panos B, Voloshynovskiy S.
Information Bottleneck Classification in Extremely Distributed Systems. *Entropy*. 2020; 22(11):1237.
https://doi.org/10.3390/e22111237

**Chicago/Turabian Style**

Ullmann, Denis, Shideh Rezaeifar, Olga Taran, Taras Holotyak, Brandon Panos, and Slava Voloshynovskiy.
2020. "Information Bottleneck Classification in Extremely Distributed Systems" *Entropy* 22, no. 11: 1237.
https://doi.org/10.3390/e22111237