Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs

A common approach to jointly learn multiple tasks with a shared structure is to optimize the model with a combined landscape of multiple sub-costs. However, gradients derived from each sub-cost often conflicts in cost plateaus, resulting in a subpar optimum. In this work, we shed light on such gradient conflict challenges and suggest a solution named Cost-Out, which randomly drops the sub-costs for each iteration. We provide the theoretical and empirical evidence of the existence of escaping pressure induced by the Cost-Out mechanism. While simple, the empirical results indicate that the proposed method can enhance the performance of multi-task learning problems, including two-digit image classification sampled from MNIST dataset and machine translation tasks for English from and to French, Spanish, and German WMT14 datasets.


Introduction
A primary goal of multi-task learning is to obtain a versatile and generalized model by effectively learning the shared portion of multiple objectives [1,2]. The growing number of models that perform well on a single task naturally increased interest in models that can simultaneously perform multiple tasks [3][4][5][6][7]. In computer vision, for example, object detection aims to predict bounding box localizations and their corresponding object categories simultaneously [8,9]. In natural language processing, we predict multiple classes and additional costs at the same time to refine the prediction [10][11][12], which is then used in sophisticated methods such as hierarchical softmax [13].
Despite the progress of such multi-tasking models, there was less attention on how to properly learn the multiple objectives with a unified structure. Summing multiple sub-costs with balancing hyperparameters [14][15][16] is a de facto standard of defining the total cost of multi-task learning. However, this strategy yields optimization difficulties because gradients of overlapped sub-cost landscapes often interfere with each other, resulting in a pseudo optimum of the total cost landscape. The reason we call it pseudo optimum is that although it is a mixture of landscapes representing the actual optimum of each sub-cost, it does not correspond to any single actual optimum. In other words, optimizing via total cost locates the optimum on a nearly flat landscape (i.e., zero gradient) which is far from the true optimum since gradients are drawn from each sub-cost are likely to conflict near the true optimum (See Figure 1). We now end up with a question-is this the best optimum we can achieve?
To answer the question, we shed light on the effect of gradient conflict-especially on cost plateaus-in multi-task learning. We usually stop training when the total cost converges. However, since multi-task learning basically combines multiple sub-costs to form a total cost, there is a high chance that plateaus will be created via this process. In this respect, we can at least presume that the true optimum is somewhere on the plateau, or worse, it could be located outside the plateau and even not close to it. As reaching the plateau does not mean reaching the true optimum, we see room for improvement. Concretely, we can even reach the above par result if we can resolve gradient conflict. Motivated by the insight that conflicts between gradients lead to pseudo optimum, we propose a method called Cost-Out, a dropout-like [17] random selection mechanism of subcosts. This mechanism stochastically samples the sub-costs to be learned at every gradient step. In a forward-backward perspective, it only backpropagates gradients of selected sub-costs. Leveraging randomness improves performance by overcoming cost plateau, which is not feasible with conventional multi-task learning methods. In this paper, we coin the induced effect of randomness as escaping pressure. We first theoretically convince its existence and analyze its properties. Empirical results demonstrates the effectiveness of Cost-Out mechanism in several multi-task learning problems such as two-digit image classification (TDS-same, TDC-disjoint) and machine translation (MT-hsoftmax, MT-sum). The performance gain is especially noticeable when the regularization effect on the model is not too strong. As the mechanism only considers how to sample sub-costs, we assert that it can be generally applicable to all multi-task learning frameworks.
The contributions of this paper are threefold: (1) To the best of our knowledge, our work is the first attempt to characterize the challenges of multi-task learning in terms of conflicts between gradients of sub-costs. (2) We propose a dropout-like mechanism called Cost-Out, and theoretically confirm its effect on inducing escaping pressure out of plateau on the total cost landscape. (3) Extensive and comprehensive experiments demonstrate that the Cost-Out mechanism is effective in several multi-task learning settings.
The rest of the paper can be break down into the following sections: • Section 2 explains the related work. • Section 3 analyzes details of escaping pressure and describes our proposed method, Cost-Out. • Section 4 includes experimental settings and results. • Section 5 thoroughly discusses the results and the core findings. • Section 6 makes a conclusion and future work.

Related Work
The mechanism of Cost-Out is exactly the same as dropout [17], except that the switching mechanism applies to the final layer. However, this approach does not improve performance in general cases, so we introduce problem conditions and applications that Cost-Out can help.
The cause and benefit of Cost-Out can be seen as Bayesian model averaging [17][18][19], a general issue in statistical modeling. Unlike averaging many ensemble models, the advantage of Cost-Out is to select only the parameters needed to split automatically.
Training neural networks with multiple sub-costs is a common form of multi-task learning [20], generalizes neural networks by allowing parameters to operate for multiple purposes, and regularizes models by increasing the required model capacity. The regularization effects typically decrease training accuracy by trade-offs, but the amount of training loss that occurs redundantly is not investigated in depth. The proposed method, Cost-Out, is expected to reduce the unnecessary inefficiency of regularization of multi-task learning.

Performance Limit Caused by Multiple Sub-Costs
In neural network training, adding sub-costs to the total cost often limits accuracy [21]. This phenomenon can be easily observed by comparing the accuracy of simultaneously predicting two identical examples-data samples satisfy the independent and identically distributed (i.i.d.) assumption-with the accuracy of predicting each example.
In preliminary experiments, we train multi-layer perceptron (MLP) [22] with MNIST training dataset for digit image classification (http://www.iro.umontreal.ca/~lisa/deep/ data/mnist/mnist.pkl.gz, accessed on 21 April 2021). This network is set to the state-ofthe-art MLP for MNIST [22], using 512 hidden nodes, hyperbolic tangent (tanh) activation, stochastic gradient descent (SGD) optimizer, and 10 −3 L 2 -regularization. We then copy one image of size 28 × 28 to create two identical images, concatenate them into a single image of size 28 × 56, and train MLP to predict two digits. The results of the preliminary experiments are shown in Table 1. In the results, concatenating two images and predicting only one class (dual-single) decreases test accuracy. This phenomenon is natural because the network cannot clearly distinguish which input dimensions are responsible for which classes. Therefore, singledigit predictions are easily interfered with by other predictions. More importantly, when the network is trained to predict two-digit classes at the same time (dual-dual), performance decreases again. One may think this may be due to limited model capacity, but in fact, the abstract features required for both digits are exactly the same. Therefore, we can presume that the network has a model capacity to show at least as much performance as a single task model. This argument is supported by preliminary results that even increasing the hidden nodes does not restore performance.

Gradient Conflict between Sub-Costs
We posit gradient conflict between sub-costs as a cause of the accuracy limitation to learning the additive total cost. As shown in Figure 1-top, summing two distinct optimal values for each sub-cost shows a higher cost than the optimum for the total cost in a simple maximization problem. This degradation of summing sub-costs is due to gradient conflict shown in the Figure 1-bottom, which cancels each other and results in a zero gradient plateau at subpar optimum. Splitting the network into a completely separate model for each sub-cost is undesirable because it loses the benefits of using multiple sub-costs for training. The purpose of Cost-Out is to take advantage of multi-task learning and reduce accuracy limits.

Method: Stochastic Switching of Sub-Costs
Unlike a typical multi-task learning scheme (see Figure 2-left), Cost-Out stochastically excludes a subset of sub-costs at each parameter update in the training phase, as illustrated in Figure 2-right. In this paper, we describe approaches of Cost-Out that attempt to incorporate "dropout" mechanism with two variants: a soft Cost-Out (sCO) and a hard Cost-Out (hCO). It is the same as dropout applied on the final layer if we apply this method with a given probability p for each sub-cost drop. However, adopting the original dropout mechanism for cost drop, which is a soft Cost-Out, is not suitable to cause escaping pressure since there is a chance of all sub-costs being selected for an update. If, then, the model again moves toward the optimum of the total cost. Therefore, we also adopt a method, updating only one sub-cost at a time, which is a hard Cost-Out.

Estimation of Escaping Pressure
Cost-Out derives the gradients using only the sampled sub-costs at every update. We can obtain a series of sub-costs by iterating updates, and we can observe several patterns by examining their gradients. It can be largely divided into two cases depending on the direction of the generated gradients. The first case is that a series of gradients repeat updates in a similar direction, which implies that local optima of a set of sub-costs exist in roughly the same location. In this case, repeating the probabilistic selection helps the movement to the optimum. The second case is that the gradients repeat the update in the opposite direction, resulting in conflicts between sub-costs near the local optimum of the total cost. In the latter case, Cost-Out leads to a drift effect by introducing additional non-zero gradients. We will show the existence of the drift and estimate their amount This relation converts the expected total gradient to This result is drawn in the Figure 1 bottom for comparison with the original gradient of the combined cost.
The simple case can be generalized to complex cost functions by extending the results to a set of selected sub-costs C i and its complement C c i . The gradient ∇ θ 1 C i at θ 1 is and the gradient τ C i after an update with selected C i is Then, we can derive the expected gradient τ overall possible combinations for C i whose sub-costs are selected by Bernoulli distribution with respect to p.
The pressure δ θ 0 can be simplified as follows: As the result of this derivation, the escaping pressure is determined by p(1 − p), the amount of Hessian diagonal and gradient multiples for all combinations of two different sub-costs.

Convergence of Cost-Out Compared to Other Optimization Methods
To see the effect of the escaping pressure on optimization, we show the convergence simulation of SGD with the estimated escaping pressure in Figure 3. For comparison, we also plot the simulation of two popular momentum-based optimizers: Adam [23] and AdaDelta [24]. In the landscape, there are three optima: optimum at center x = 0 (o c ), optima at left and right boundary of conflict region (o l and o r , respectively). If the probability of sub-cost selection is assumed as even, the use of Cost-Out forces gradientbased stochastic optimizer to move toward the two candidate optima o l and o r with an equal chance. This causes stochastic perturbation since small movements by update within the conflict region diverge to the boundaries, as summarized in Table 2.  By the effect of escaping pressure, the model parameters are updated repeatedly until they meet the regularities of two optima o l and o r . This phenomenon is only observable when the optimization procedure is affected by the escaping pressure but not momentum (e.g., Adam, AdaDelta). Those momentum-based optimizers are designed to correctly find the optimum of the total cost rather than changing the cost landscape. While both dropout and Cost-Out induce escaping pressure, only Cost-Out reduces the gradient conflict between sub-costs since the original dropout applied to the internal layers only decreases the total cost via parameter update.

Experiments and Results
Cost-Out is a generic yet straightforward mechanism that drops gradients of partial tasks rather than simultaneously learning the entire gradients of every task. To verify the effectiveness of Cost-Out, we adopt two representative tasks in the field of vision and language-image classification and machine translation.

Classifying Two-Digit Images Sampled from the Same Set (TDC-Same)
The goal of this problem is to predict two digits with a neural network from a concatenated two 28 × 28 input images of MNIST dataset [25]. Compared to separately classifying each digit from its corresponding image, this problem is more complex because of the interaction between features of two different images in a single network. The cost function is defined as below: where d is the length of output vector o. The partial output vector o 1 and o 2 are generated from two input images. y 1 and y 2 are one-hot vectors to indicate correct digit label index. The data set D is composed of inputs generating o 1 and o 2 , and corresponding y 1 and y 2 . The f TDC is the expectation of the two sub-costs over all data samples in D.
We use pre-split 50,000 training, 10,000 validation, and 10,000 test set as a publically released setting (http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz, accessed on 21 April 2021). Samples in each set are copied once, randomly shuffled, and then concatenated with the original set. Final data have the same sample size as the MNIST data, but each sample has twice a larger input dimension and output class size.
The impact of Cost-Out is likely to be affected by the other regularization techniques since it also has a regularization effect. For example, if regularization is too strong compared to the given model capacity, Cost-Out may decrease performance. To evaluate the performance of Cost-Out under this regularization-sensitive condition, we select combinations of typical regularization methods such as L 2 penalization, batch normalization (BN) [26], dropout, and model size changing. The detailed combination settings are shown in the Table 3. Table 3. Hyper-parameter settings for TDC-same task. BN is decayed by gradually reducing the interpolation rate between normalized and original activations. The optimal number of hidden nodes of the layers reported in the original MNIST challenge is near 800 [27], so we test one smaller model and the one larger model than the reported model. Model parameters are initialized by randomly selecting a real value in [− 1 √ n , 1

Hyper-Parameter Value
√ n ] where n is the number of parameters of a layer [28].

Performance Recovery under Various Regularization Effects
We plot the maximum performance of TDC-same in the Figure 4. This result is collected from various combinations of regularization methods and hyper-parameters. The optimal L 2 is 10 −5 for all settings. The ensemble shows the best performance among the overall regularization settings. TDC-same problem has no advantage from resource sharing so that only the negative effect of gradient conflict occurs. Thus, the ensemble completely removes the conflict issue, and its result is the upper bound of the performance of multi-task learning. Both Cost-Out versions have improved performance compared to an ordinary case and even the case using dropout, which implies that the model averaging and regularization effects of dropout are orthogonal to the sub-cost switching. In Table 4, detailed numerical results are shown. This ensemble result is higher than the case of not using Cost-Out by 0.12% at the best cases. Applying Cost-Out improves performance compared to the best result of not using the Cost-Out method by 0.04% precision, which is 29% recovery of the performance decrease by using multiple sub-costs. In the case of dropout, it seriously decreases the best performance in large L 2 scales, and its maximum was lower than the Cost-Out methods. This evaluation confirms that applying Cost-Out can recover the decreased best performance by using multiple sub-costs. In Table 5, more detailed performance changes are plotted.
With stochastic gradient descent, the performance gain from Cost-Out is much more significant than the standard model. Regularization does not entirely explain this gain because the effect is consistently observed under various regularization strength-L 2 penalty scales and dropout. Even when using Adam, the gain is smaller, but the performance is not the same as in the typical case, which implies the effect of escaping pressure. When applying batch normalization, the gain in the SGD case almost disappears. In overall results, we can see that there is some evidence that escaping pressure affects the performance, but it can be easily hidden by batch normalization or optimizer.

Relaxation of Gradient Conflict
To investigate the impact of Cost-Out on optimization, we investigate the mean and max absolute values of all gradient elements with respect to the total cost, called gradient scale, in this section. This metric represents the steepness near the optima in the convex rather than the gradient values for various sub-cost combinations.
In Figure 5, we see that Cost-Out vastly increases gradient scale. This phenomenon supports that adopting Cost-Out causes escaping pressure. When applying Cost-Out, the gradient scale with respect to the total cost is the sum of the values in convex landscapes of its sub-costs, which is not affected by the gradient canceling. Therefore, applying Cost-Out can increase the gradient scale by the canceled amount, consistent with the result.

Classifying Two-Digit Images Sampled from the Two Disjoint Sets (TDC-Disjoint)
In TDC-same task, parameters of all layers except the final fully-connected layer are shared with all sub-costs. Therefore, the optima for two sub-costs using the copied data may be similar, even if not the same, by random initialization on the final layer. To prepare a more practical environment generating different optima for sub-costs, we set a new problem TDC-disjoint using two different disjoint image sets for 0 to 4 and 5 to 9 of MNIST data. Cost is calculated as the Equation (10), but the used data set D is composed of concatenated vectors of two disjoint sets. Training, validation, and test sample sizes are half of the TDC-same task. Practical networks are usually very deep, so we focus on evaluating the effect change by the depth increase. In this experiment, we use batch normalization with decaying, Adam optimizer with learning rate 1 × 10 −4 , 2000 hidden nodes per layer, tanh activation, hard Cost-Out, and no penalization. In the setting, we vary the number of layers from 1 to 10.

Performance Change by Deep Structuring
The impact of the escaping pressure is amplified by increasing the number of layers. Figure 6 shows the best precision in training data averaged over a total of five runs. There is no difference between using Cost-Out and not using it in the first to fifth layers, but using Cost-Out shows better results from the sixth layers. However, neither model found the optimum when they reached a depth of 9 to 10.

Machine Translation with Hierarchical Softmax (MT-Hsoftmax)
For machine translation tasks with hierarchical softmax, we combine two data sets, Europarl-v7 and CommonCrawl data provided from WMT 2014 (http://www.statmt. org/wmt14/, accessed on 21 April 2021). Tokenizing, lower-casing and cutting off with 40 tokens are applied via tools provided by MOSES [29] (http://www.statmt.org/moses/, accessed on 21 April 2021). We have set up six tasks from French, Spanish, and German to English translation and vice versa. Each training set has 1.5 million sentence pairs and its 10% is used as a validation set. The test sets are the Newstest set consisting of 3000 sentences and the News-Commentary set consisting of 150,000 sentences. To set sufficient model capacity, we use four stacks of 1000 LSTM cells for the encoder and the same size for a decoder. Word vectors are explicitly trained by word2vec [30] and imported in training and translation phases (https://code.google.com/archive/p/word2vec/, accessed on 21 April 2021). Detailed model parameters are shown in Table 6. We use a bidirectional recurrent neural network with global attention [31,32]. BN is applied to all net values for gate and cell vectors. The weight of normalization is decayed and converged to near 0 after about 20 epochs. Cost-Out is applied only in the training phase. We use Adam optimizer, which showed better results than SGD and AdaDelta in preliminary experiments.
To create a simple combined cost, we design a k-expansion softmax function defined as follows: where y is a one-hot vector to indicate a correct class index y and y k is the k-expansion of y. The constant K is the number of sub-tasks equal to log k y . The vector (o) j i is the segment of o from the i-th to the (j − 1)-th element. Thus, each segment of the output vector represents the probability of selecting a correct class index at each position in the k-expansion of y. In our experiments, we set k as 1024 and the length K as 2 to cover vocabulary size more than one million.

Effect of Cost-Out in MT-hsoftmax
In MT-hsoftmax task, we validate the benefits of using Cost-Out by predicting a correct word and position of the target sentences with mean precision and BLEU metrics. The results are shown in the Table 7. Since the achievable translation quality largely varies in translation tasks, we measure the performance change by applying Cost-Out. In the results, the performance gain (δ) in mean precision and BLEU (δ) are almost positive when using Cost-Out, implying that Cost-Out improves translation quality. While dropout is not recommended to use in NMT because of its high sensitivity [33], Cost-Out can improve the performance without largely destroying the trained internal information.

Machine Translation Summing Costs of All Target Words (MT-Sum)
In Neural Machine Translation, summing cross-entropy values for classifying each word in the target sentence is a common approach which is also regarded as multi-task learning approach. To evaluate the impact of the escaping pressure, we set all the environment same as the MT-hsoftmax configuration except using hierarchical softmax. The cost function is defined as follows: where L is the length of tokens of a target sentence. Hard Cost-Out is applied to turn on and off of randomly selected half of the total words in a sentence.

Effect of Cost-Out in MT-Sum
The results of MT-sum task are shown in the Table 8. As with the previous results, we can see the effectiveness of Cost-Out by comparing the with and without Cost-Out. Although some of the results showed rather poor performance, they generally achieved positive performance gain. Through two experiments, MT-hsoftmax and MT-sum, we empirically demonstrate that applying Cost-Out when learning multiple sub-costs can bring the effect of finding a better optimum regardless of the sub-cost setting.

Discussion
We believe the main reason for the improved test performance in overall experiments is the escaping pressure induced by the Cost-Out mechanism (see Tables 4, 7 and 8). Another possible reason for performance improvement may be due to other regularization methods. To find the setting that has little interference from regularization as possible and to verify the consistent effectiveness of Cost-Out in a various environment (i.e., different hyperparameter settings), we perform a grid search of hyperparameters, including batch normalization, drop-out, L 2 -regularization, model capacity (i.e., dimension of the hidden layer), and gradient-based optimization methods (e.g., SGD, Adam) (see Figure 4, Table 5). Although using both Cost-Out and regularization can affect performance, using Cost-Out achieved the best results under most settings, implying that Cost-Out can generate orthogonal effects to the other regularization methods-we see this as an escaping pressure. Moreover, the plateau in the cost landscape, where the escaping pressure can improve the performance, is also observed in the results of gradient scale evaluation (see Figure 5). The change of gradient scale after applying for Cost-Out shows that the model converges to an optimum whose surrounding area has a steeper landscape. This observation supports that (1) the sum of training sub-costs for multitask learning can flatten optima and (2) Cost-Out causes escaping pressure to move the converging point to the boundary of the flattened area.
This escaping effect helps training in deep structures. As in Figure 6, stacking neural network layers gradually decreases training accuracy by gradient vanishing and probably landscape flattening by using more model parameters. Applying Cost-Out to this model improves the precision when the precision suffers from the negative effects by scaling up the model. We conduct two experiments in different domains-vision and language-to show that the proposed methodology is domain-and structure-agnostic. We believe that it is not limited to these two tasks and can apply to more diverse multi-task learning problems since the approach itself can be applied if the total cost is defined as the summation of sub-costs.

Conclusions and Future Work
In this paper, we address the gradient conflict problem that arises during multitask learning of neural networks. We postulate that this common problem in optimizing parameterized models can be solved with the escaping pressure induced in neural networks by applying Cost-Out, a dropout-like random selection mechanism. In the experiments, we empirically confirm the existence of escaping pressure that automatically selects gradients responsible for each task and forces them to learn the optimums for each sub-cost and its impact on two-digit image classifications and machine translation. Finally, we observe from the results that deep structured and insufficiently regularized models improve performance when using Cost-Out.
This work can be extended to demonstrate the benefits of using mini-batch-based training since the random selection of mini-batch for each update is identical to selecting a sub-cost for each mini-batch.