# Mixture of Experts with Entropic Regularization for Data Classification

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Mixture-of-Experts

#### EM Algorithm for Mixture-of-Experts

**E Step:**The expected value of the assignment variable ${z}_{ni}$ is inferred by applying the Bayes theorem:

**M Step:**The expected complete log-likelihood function for training data is defined as:

## 3. Mixture-of-Experts with Entropic Regularization

## 4. Experiments

- Log-likelihood: we measured the value of the log-likelihood function for each iteration with both methods. Specifically, we analyzed the convergence of the algorithms.
- Average accuracy: we measured the accuracy of the prediction by examining the average of the results given by the cross-validation procedure. We analyzed these values considering the number of experts, which corresponds to 10, 20, 30, 40, and 50 experts.
- Average entropy: we measured the average entropy value of the gate network outputs. We used these values to visually analyze the entropy behavior when it is incorporated into the cost function.

#### 4.1. Datasets

#### 4.2. Log-Likelihood Analysis

#### 4.3. Accuracy Analysis

#### 4.4. Visual Analysis of Average Entropy of Gate Network Outputs

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Oxford, UK, 2006. [Google Scholar]
- Jones, T.R.; Carpenter, A.E.; Lamprecht, M.R.; Moffat, J.; Silver, S.J.; Grenier, J.K.; Castoreno, A.B.; Eggert, U.S.; Root, D.E.; Golland, P.; et al. Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning. Proc. Natl. Acad. Sci. USA
**2009**, 106, 1826–1831. [Google Scholar] [CrossRef] [PubMed][Green Version] - Kosala, R.; Blockeel, H. Web Mining Research: A Survey. SIGKDD Explor. Newslett.
**2000**, 2, 1–15. [Google Scholar] [CrossRef] - Crawford, M.; Khoshgoftaar, T.M.; Prusa, J.D.; Richter, A.N.; Al Najada, H. Survey of review spam detection using machine learning techniques. J. Big Data
**2015**, 2, 23. [Google Scholar] [CrossRef] - Pazzani, M.J.; Billsus, D.; Kobsa, A.; Nejdl, W. Content-Based Recommendation Systems. In The Adaptive Web: Methods and Strategies of Web Personalization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 325–341. [Google Scholar]
- Jacobs, R.; Jordan, M. Adaptive Mixture of Local Experts; Department of Brain and Cognitive Science, Massachusetts Institute of Technology: Cambridge, MA, USA, 1991. [Google Scholar]
- Estabrooks, A.; Japkowicz, N. A Mixture-of-experts Framework for Text Classification. In Proceedings of the 2001 Workshop on Computational Natural Language Learning, Toulouse, France, 6–7 July 2001; Association for Computational Linguistics: Stroudsburg, PA, USA, 2001; Volume 71, p. 9. [Google Scholar]
- Ebrahimpour, R.; Nikoo, H.; Masoudnia, S.; Yousefi, M.R.; Ghaemi, M.S. Mixture of MLP-experts for trend forecasting of time series: A case study of the Tehran stock exchange. Int. J. Forecast.
**2011**, 27, 804–816. [Google Scholar] [CrossRef] - Gupta, R.; Audhkhasi, K.; Narayanan, S. A mixture of experts approach towards intelligibility classification of pathological speech. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), South Brisbane, Australia, 19–24 April 2015; pp. 1986–1990. [Google Scholar]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the ICLR Conference, Toulon, France, 24–26 April 2017. [Google Scholar]
- Yu, L.; Yue, W.; Wang, S.; Lai, K. Support vector machine based multiagent ensemble learning for credit risk evaluation. Expert Syst. Appl.
**2010**, 37, 1351–1360. [Google Scholar] [CrossRef] - Yuille, A.L.; Geiger, D. Winner-Take-All Mechanisms. In Handbook of Brain Theory and Neural Networks; Arbib, M.A., Ed.; MIT Press: Cambridge, MA, USA, 1995; pp. 1–1056. [Google Scholar]
- Shashanka, M.; Raj, B.; Smaragdis, P. Probabilistic Latent Variable Models as Non-Negative Factorizations; Technical Report TR2007-083; MERL-Mitsubishi Electric Research Laboratories: Cambridge, MA, USA, 2007. [Google Scholar]
- Grandvalet, Y.; Bengio, Y. Semi-supervised Learning by Entropy Minimization. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2005; pp. 529–536. [Google Scholar]
- Yang, M.; Chen, L. Discriminative Semi-Supervised Dictionary Learning with Entropy Regularization for Pattern Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Moerland, P. Some Methods for Training Mixtures of Experts; Technical Report; IDIAP Research Institute: Martigny, Switzerland, 1997. [Google Scholar]
- Jordan, M.I.; Xu, L. Convergence Results for the EM Approach to Mixtures of Experts Architectures; Department of Brain and Cognitive Science, Massachusetts Institute of Technology: Cambridge, MA, USA, 1993. [Google Scholar]
- Peralta, B.; Soto, A. Embedded local feature selection within mixture of experts. Inf. Sci.
**2014**, 269, 176–187. [Google Scholar] [CrossRef] - Arbib, M.A. The Handbook of Brain Theory and Neural Networks, 1st ed.; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
- Bay, S.; Kibler, D.; Pazzani, M.; Smyth, P. The UCI KDD Archive of Large Data Sets for Data Mining Research and Experimentation; Department of Information and Computer Science University of California: Irvine, CA, USA, 2000. [Google Scholar]

**Figure 2.**Log-likelihood values with 20 experts for the classical MoE and the entropic MoE (EMoE) for all datasets. In these experiments, we mainly used 50 iterations.

**Figure 3.**Average entropy scores in the network gate outputs for the Ionosphere, Spectf, Sonar, and Musk datasets in the MoE and EMoE models with 10 experts.

**Figure 4.**Average entropy scores in the network gate outputs for the Arrhythmia, Secom, Pie10P, and Leukemia datasets in the MoE and EMoE models with 10 experts.

Dataset Name | Number of Instances | Dimensionality | Number of Classes |
---|---|---|---|

Ionosphere | 351 | 33 | 2 |

Spectf | 267 | 44 | 2 |

Sonar | 208 | 61 | 2 |

Musk-1 | 486 | 168 | 2 |

Arrhythmia | 452 | 279 | 16 |

Secom | 1567 | 471 | 2 |

PIE10P | 210 | 1000 | 10 |

Leukemia | 75 | 1500 | 2 |

**Table 2.**Summary of the best parameters found by the grid search procedure for each of the datasets analyzed and the number of experts.

Dataset | K = 10 | K = 20 | K = 30 | K = 40 | K = 50 |
---|---|---|---|---|---|

Ionosphere | −32 | −128 | −16 | −32 | −128 |

Spectf | 128 | 128 | −2 | −1.5 | 8 |

Sonar | −1 | −1 | 64 | −1.5 | −2 |

Musk | 32 | −32 | −32 | −16 | −16 |

Arrhythmia | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |

Secom | 8 | 4 | 8 | 8 | 32 |

PIE10P | 128 | 128 | 128 | 128 | 128 |

Leukemia | 8 | 128 | 64 | 32 | 128 |

**Table 3.**Average classification accuracy (plus its standard deviation), using 30-fold stratified cross-validation for the classical MoE and EMoE. The accuracies are obtained considering a different number of experts K ($K=10,20,30,40,50$). The best result per dataset and number of experts is shown in bold.

Dataset | K = 10 | K = 20 | K = 30 | K = 40 | K = 50 | |||||
---|---|---|---|---|---|---|---|---|---|---|

MoE | EMoE | MoE | EMoE | MoE | EMoE | MoE | EMoE | MoE | EMoE | |

Ionosphere | 85.1% (0.022) | 88.4% (0.015) | 87.9% (0.025) | 90.1% (0.023) | 86.9% (0.024) | 91.0% (0.025) | 87.3% (0.020) | 90.7% (0.023) | 87.6% (0.029) | 91.1% (0.026) |

Spectf | 70.6% (0.067) | 72.8% (0.073) | 72.7% (0.044) | 78.0% (0.127) | 68.0% (0.067) | 73.2% (0.155) | 71.0% (0.086) | 75.5% (0.075) | 72.5% (0.082) | 74.8% (0.093) |

Sonar | 67.5% (0.046) | 67.5% (0.040) | 67.2% (0.038) | 67.6% (0.047) | 69.2% (0.043) | 69.0% (0.041) | 69.24% (0.052) | 69.28% (0.047) | 67.5% (0.059) | 67.9% (0.059) |

Musk | 75.7% (0.031) | 75.8% (0.030) | 75.9% (0.024) | 76.1% (0.027) | 75.8% (0.022) | 76.1% (0.017) | 76.6% (0.033) | 76.7% (0.037) | 77.4% (0.034) | 77.2% (0.032) |

Arrhythmia | 48.2% (0.035) | 49.7% (0.033) | 51.3% (0.048) | 55.1% (0.063) | 48.3% (0.032) | 56.5% (0.058) | 49.8% (0.028) | 55.0% (0.063) | 50.3% (0.035) | 57.0% (0.038) |

Secom | 88.8% (0.012) | 92.1% (0.008) | 89.1% (0.010) | 92.2% (0.010) | 89.2% (0.014) | 92.3% (0.009) | 89.0% (0.012) | 92.4% (0.009) | 89.6% (0.012) | 92.7% (0.010) |

PIE10P | 100% (0) | 100% (0) | 99.96% (0.001) | 99.96% (0.001) | 100% (0) | 100% (0) | 100% (0) | 100% (0) | 100% (0) | 100% (0) |

Leukemia | 80.8% (0) | 80.8% (0) | 80.6% (0.001) | 80.5% (0.001) | 98.2% (0) | 98.2% (0) | 97.4% (0) | 97.4% (0) | 98.3% (0) | 98.3% (0) |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Peralta, B.; Saavedra, A.; Caro, L.; Soto, A. Mixture of Experts with Entropic Regularization for Data Classification. *Entropy* **2019**, *21*, 190.
https://doi.org/10.3390/e21020190

**AMA Style**

Peralta B, Saavedra A, Caro L, Soto A. Mixture of Experts with Entropic Regularization for Data Classification. *Entropy*. 2019; 21(2):190.
https://doi.org/10.3390/e21020190

**Chicago/Turabian Style**

Peralta, Billy, Ariel Saavedra, Luis Caro, and Alvaro Soto. 2019. "Mixture of Experts with Entropic Regularization for Data Classification" *Entropy* 21, no. 2: 190.
https://doi.org/10.3390/e21020190