# Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Cost-Out: Sub-Cost Dropout Inducing Escaping Pressure

#### 3.1. Motivation

#### 3.1.1. Performance Limit Caused by Multiple Sub-Costs

#### 3.1.2. Gradient Conflict between Sub-Costs

#### 3.2. Method: Stochastic Switching of Sub-Costs

#### 3.3. Estimation of Escaping Pressure

#### 3.4. Convergence of Cost-Out Compared to Other Optimization Methods

## 4. Experiments and Results

#### 4.1. Classifying Two-Digit Images Sampled from the Same Set (TDC-Same)

#### 4.1.1. Performance Recovery under Various Regularization Effects

#### 4.1.2. Relaxation of Gradient Conflict

#### 4.2. Classifying Two-Digit Images Sampled from the Two Disjoint Sets (TDC-Disjoint)

#### Performance Change by Deep Structuring

#### 4.3. Machine Translation with Hierarchical Softmax (MT-Hsoftmax)

#### Effect of Cost-Out in MT-hsoftmax

#### 4.4. Machine Translation Summing Costs of All Target Words (MT-Sum)

#### Effect of Cost-Out in MT-Sum

## 5. Discussion

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Caruana, R. Multitask Learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Sener, O.; Koltun, V. Multi-Task Learning As Multi-Objective Optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 527–538. [Google Scholar]
- Kokkinos, I. Ubernet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6129–6138. [Google Scholar]
- Zamir, A.R.; Sax, A.; Shen, W.; Guibas, L.J.; Malik, J.; Savarese, S. Taskonomy: Disentangling Task Transfer Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3712–3722. [Google Scholar]
- Liu, X.; He, P.; Chen, W.; Gao, J. Multi-Task Deep Neural Networks for Natural Language Understanding. arXiv
**2019**, arXiv:1901.11504. [Google Scholar] - Clark, K.; Luong, M.T.; Khandelwal, U.; Manning, C.D.; Le, Q. BAM! Born-Again Multi-Task Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 5931–5937. [Google Scholar]
- Lu, J.; Goswami, V.; Rohrbach, M.; Parikh, D.; Lee, S. 12-in-1: Multi-Task Vision and Language Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10437–10446. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Mi, H.; Sankaran, B.; Wang, Z.; Ittycheriah, A. Coverage Embedding Models for Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–4 November 2016; pp. 955–960. [Google Scholar]
- See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 1073–1083. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
- Baxter, J. A Model of Inductive Bias Learning. J. Artif. Intell. Res.
**2000**, 12, 149–198. [Google Scholar] [CrossRef] - Fliege, J.; Svaiter, B.F. Steepest Descent Methods for Multicriteria Optimization. Math. Methods Oper. Res.
**2000**, 51, 479–494. [Google Scholar] [CrossRef] - Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv
**2012**, arXiv:1207.0580. [Google Scholar] - Baldi, P.; Sadowski, P.J. Understanding dropout. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2814–2822. [Google Scholar]
- Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
- Oda, Y.; Arthur, P.; Neubig, G.; Yoshino, K.; Nakamura, S. Neural Machine Translation via Binary Code Prediction. arXiv
**2017**, arXiv:1704.06918. [Google Scholar] - Sivaram, G.S.; Hermansky, H. Sparse multilayer perceptron for phoneme recognition. IEEE Trans. Audio Speech Lang. Process.
**2012**, 20, 23–29. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Zeiler, M.D. ADADELTA: An adaptive learning rate method. arXiv
**2012**, arXiv:1212.5701. [Google Scholar] - LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef][Green Version] - Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Simard, P.Y.; Steinkraus, D.; Platt, J.C. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, UK, 6 August 2003; p. 958. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, 23–30 June 2007; pp. 177–180. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv
**2013**, arXiv:1301.3781. [Google Scholar] - Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv
**2015**, arXiv:1508.04025. [Google Scholar] - Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv
**2014**, arXiv:1409.2329. [Google Scholar]

**Figure 1.**Cost landscape for combined and decomposed costs. (

**top**) gradient conflict in cost perspective. (

**bottom**) gradient conflict in gradient perspective. (cost1: a normal distribution $N(0.3,0.3)$, cost2: $N(-0.3,0.3)$, total cost: sum of cost1 and cost2, max: sum of optima of cost1 and cost2, costout: the expected gradient calculated in Equation (3)).

**Figure 2.**A schematic illustration of the Cost-Out mechanism. (

**left**) Typical multi-task learning scheme learns all sub-costs per each update. (

**right**) Cost-Out randomly drops the sub-costs with a given probability and learns only the remaining sub-costs per each update. The dotted line and solid line indicates dropped sub-cost and remaining sub-cost respectively.

**Figure 3.**Comparison of convergence patterns between SGD + Cost-Out and other optimization methods. While there is only one optimum for Adam and AdaDelta, three possible optima exist for Cost-Out (i.e., gradient = 0): left (${o}_{l}$), center (${o}_{c}$), and right (${o}_{r}$).

**Figure 4.**Achievable best performance of Cost-Out under ${L}_{2}$ regularization and batch normalization. (sCO: soft Cost-Out, hCO: hard Cost-Out, DO: dropout, noCO: without Cost-Out).

**Figure 5.**The change in precision-gradient scale in the training phase of the TDC-same task. (sCO: soft Cost-Out, hCO: hard Cost-Out, noCO: without Cost-Out).

**Figure 6.**Best precision averaged over five models by the number of stacked layers with or without Cost-Out in TDC-disjoint task.

**Table 1.**Preliminary experiments of digit image classification using MLP trained with MNIST dataset. We verify the accuracy decrease of MLP by task extension. (single input: 28 × 28, dual input 28 × 56, single output: 1 digit, dual output: 2 digits).

Input Image | Output Class | Test Precision |
---|---|---|

single | single | 97.99 |

dual | single | 97.57 |

dual | dual | 97.17 |

Gradient Sign | ||||||
---|---|---|---|---|---|---|

Near ${\mathit{o}}_{\mathit{l}}$ | Near ${\mathit{o}}_{\mathit{c}}$ | Near ${\mathit{o}}_{\mathit{r}}$ | ||||

Left | Right | Left | Right | Left | Right | |

w/Cost-Out | + | - | - | + | + | - |

w/o Cost-Out | + | + | + | - | - | - |

Convergence to Optimum | ||||||

w/Cost-Out | converge | diverge | converge | |||

w/o Cost-Out | pass | converge | pass |

Hyper-Parameter | Value |
---|---|

Dropout Probability | $0.1,0.5,0.9$ |

Batch Normalization | decaying |

· Decaying Rate | 0.99 per epoch |

${L}_{2}$ scale | $0,{10}^{-5},5\times {10}^{-5},$ |

${10}^{-4},5\times {10}^{-4},{10}^{-3}$ | |

Cost-Out Type | soft, hard |

Optimizer | SGD, Adam |

· SGD Learning Rate | ${10}^{-2},{10}^{-3}$ |

· Adam Learning Rate | ${10}^{-4}$ |

· Adam (${\beta}_{1}$, ${\beta}_{2}$) | $(0.9,0.999)$ |

Hidden Layer Size | $512,2048$ |

Batch Size | 32 |

Activation | tanh |

**Table 4.**Best performance in the test set of TDC-same task. Common best configuration: Adam, no dropout, 2048 hidden nodes, ${10}^{-4}$${L}_{2}$ scale (P.: precision, p: $\mathrm{recovery}\mathrm{rate}=\frac{\mathrm{error}\mathrm{by}\mathrm{multiple}\mathrm{sub}-\mathrm{costs}}{\mathrm{error}\mathrm{after}\mathrm{applying}\mathrm{Cost}-\mathrm{Out}}$).

Input | Output | Method | Best P. | p | BN |
---|---|---|---|---|---|

dual | single | ensemble | 97.96 | – | o |

dual | dual | without Cost-Out | 97.84 | 0.00 | x |

dual | dual | with hard Cost-Out | 97.87 | 0.21 | x |

dual | dual | with soft Cost-Out | 97.88 | 0.29 | x |

**Table 5.**Detailed performance change by applying Cost-Out or dropout in TDC-same with various settings. (d: dimension of the hidden layer, y-axis: average precision, x-axis: scale of ${L}_{2}$ penalty).

d = 512 | d = 2048 | d = 512 | d = 2048 |
---|---|---|---|

SGD | Adam | ||

SGD + batch norm | Adam + batch norm |

Hyper-Parameter | Parameter Size | ||
---|---|---|---|

LSTM Stacks | 4 | Encoder | 3.05 M |

Cells per Stacks | 1000 | Decoder | 3.10 M |

dim. of Word | 50 | Output | 11 M |

dim. of Attention | 250 | Interface | 0.19 M |

Batch Size | 128 |

**Table 7.**Performance change by using Cost-Out in MT-hsoftmax. ($CO$: Cost-Out, $S-T$: source-target, $P.$: precision, $B.$: BLEU, w and $w/o$: with and without Cost-Out, $\delta $: performance gain when using Cost-Out (score of w− score of $w/o$), $\mu \left(\delta \right)$: mean of $\delta $, $\sigma \left(\delta \right)$: standard deviation of $\delta $).

Best Performance in Each Set | Performance of Selected Model | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Valid | Test1 | Test2 | Train | Test1 | Test2 | |||||||

$\mathit{CO}$ | $\mathit{S}-\mathit{T}$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{B}.$ |

w | En-Fr | 15.59 | 20.23 | 15.05 | 16.40 | 12.33 | 16.84 | 57.43 | 14.81 | 16.26 | 12.33 | 16.65 |

$w/o$ | 15.20 | 19.86 | 14.73 | 15.65 | 12.06 | 16.15 | 57.41 | 14.64 | 15.65 | 11.97 | 16.15 | |

w | Fr-En | 19.99 | 27.25 | 13.35 | 12.34 | 12.76 | 14.58 | 56.41 | 13.04 | 12.34 | 12.64 | 14.58 |

$w/o$ | 18.48 | 26.08 | 12.16 | 11.23 | 11.52 | 13.52 | 55.04 | 12.16 | 10.67 | 11.52 | 13.26 | |

w | En-Es | 16.07 | 16.50 | 14.75 | 13.38 | 14.50 | 16.27 | 48.84 | 14.75 | 12.37 | 14.50 | 15.05 |

$w/o$ | 16.14 | 16.24 | 14.71 | 13.33 | 14.81 | 16.23 | 47.03 | 14.71 | 12.43 | 14.76 | 14.80 | |

w | Es-En | 17.87 | 17.68 | 15.79 | 13.65 | 15.24 | 15.60 | 23.48 | 15.79 | 13.65 | 15.24 | 15.60 |

$w/o$ | 17.21 | 17.33 | 15.27 | 13.21 | 14.72 | 15.43 | 20.70 | 15.27 | 13.21 | 14.72 | 15.43 | |

w | En-De | 15.57 | 10.12 | 13.31 | 6.51 | 10.77 | 5.78 | 41.12 | 13.00 | 5.25 | 10.59 | 4.67 |

$w/o$ | 15.62 | 10.05 | 13.47 | 6.50 | 10.66 | 5.72 | 43.00 | 13.47 | 6.24 | 10.66 | 5.21 | |

w | De-En | 14.60 | 12.93 | 13.24 | 8.79 | 10.83 | 8.29 | 45.41 | 13.24 | 8.79 | 10.83 | 8.29 |

$w/o$ | 14.54 | 12.60 | 13.19 | 8.34 | 10.67 | 7.89 | 47.02 | 12.93 | 8.00 | 10.67 | 7.81 | |

$\mu \left(\delta \right)$ | 0.42 | 0.43 | 0.33 | 0.47 | 0.33 | 0.40 | 0.42 | 0.24 | 0.41 | 0.30 | 0.36 | |

$\sigma \left(\delta \right)$ | 0.61 | 0.38 | 0.49 | 0.42 | 0.52 | 0.40 | 1.90 | 0.46 | 0.89 | 0.49 | 0.60 |

Best Performance in Each Set | Performance of Selected Model | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Valid | Test1 | Test2 | Train | Test1 | Test2 | |||||||

$\mathit{CO}$ | $\mathit{S}-\mathit{T}$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{P}.$ | $\mathit{B}.$ | $\mathit{P}.$ | $\mathit{B}.$ |

w | En-Fr | 19.43 | 26.60 | 18.92 | 22.52 | 15.73 | 22.88 | 65.30 | 18.59 | 22.28 | 15.51 | 22.77 |

$w/o$ | 20.48 | 31.74 | 13.99 | 15.69 | 13.72 | 18.77 | 68.46 | 13.99 | 15.69 | 13.61 | 18.77 | |

w | Fr-En | 21.45 | 32.11 | 14.15 | 15.54 | 13.92 | 18.98 | 70.41 | 14.15 | 15.54 | 13.92 | 18.98 |

$w/o$ | 20.21 | 31.44 | 13.27 | 15.08 | 13.36 | 18.64 | 69.94 | 13.22 | 15.05 | 13.36 | 18.64 | |

w | En-Es | 19.98 | 23.24 | 17.35 | 19.87 | 18.77 | 24.77 | 59.50 | 17.31 | 19.87 | 18.77 | 24.77 |

$w/o$ | 19.71 | 22.73 | 17.20 | 19.48 | 18.53 | 24.49 | 57.37 | 17.07 | 19.26 | 18.53 | 24.49 | |

w | Es-En | 21.95 | 24.62 | 18.76 | 19.52 | 19.66 | 24.24 | 59.36 | 18.53 | 19.27 | 19.66 | 24.24 |

$w/o$ | 21.73 | 24.67 | 18.38 | 19.73 | 19.45 | 24.29 | 59.69 | 18.21 | 19.5 | 19.45 | 24.29 | |

w | En-De | 18.28 | 14.47 | 15.46 | 10.46 | 12.52 | 9.54 | 51.18 | 15.26 | 10.39 | 12.51 | 9.40 |

$w/o$ | 18.24 | 14.52 | 15.60 | 10.65 | 13.01 | 10.09 | 48.96 | 15.40 | 10.65 | 12.58 | 9.45 | |

w | De-En | 17.30 | 17.74 | 15.53 | 13.63 | 13.27 | 13.29 | 50.07 | 15.39 | 13.48 | 13.27 | 13.29 |

$w/o$ | 16.96 | 17.38 | 15.28 | 13.30 | 12.85 | 12.75 | 49.22 | 14.98 | 13.13 | 12.70 | 12.56 | |

$\mu \left(\delta \right)$ | 0.18 | −0.62 | 1.08 | 1.27 | 0.49 | 0.78 | 0.36 | 1.06 | 1.26 | 0.57 | 0.88 | |

$\sigma \left(\delta \right)$ | 0.73 | 2.24 | 1.92 | 2.74 | 0.83 | 1.68 | 1.99 | 1.77 | 2.64 | 0.70 | 1.56 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Woo, S.; Kim, K.; Noh, J.; Shin, J.-H.; Na, S.-H.
Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs. *Electronics* **2021**, *10*, 989.
https://doi.org/10.3390/electronics10090989

**AMA Style**

Woo S, Kim K, Noh J, Shin J-H, Na S-H.
Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs. *Electronics*. 2021; 10(9):989.
https://doi.org/10.3390/electronics10090989

**Chicago/Turabian Style**

Woo, Sangmin, Kangil Kim, Junhyug Noh, Jong-Hun Shin, and Seung-Hoon Na.
2021. "Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs" *Electronics* 10, no. 9: 989.
https://doi.org/10.3390/electronics10090989