Cross Entropy in Deep Learning of Classifiers Is Unnecessary—ISBE Error Is All You Need

In deep learning of classifiers, the cost function usually takes the form of a combination of SoftMax and CrossEntropy functions. The SoftMax unit transforms the scores predicted by the model network into assessments of the degree (probabilities) of an object’s membership to a given class. On the other hand, CrossEntropy measures the divergence of this prediction from the distribution of target scores. This work introduces the ISBE functionality, justifying the thesis about the redundancy of cross-entropy computation in deep learning of classifiers. Not only can we omit the calculation of entropy, but also, during back-propagation, there is no need to direct the error to the normalization unit for its backward transformation. Instead, the error is sent directly to the model’s network. Using examples of perceptron and convolutional networks as classifiers of images from the MNIST collection, it is observed for ISBE that results are not degraded with SoftMax only but also with other activation functions such as Sigmoid, Tanh, or their hard variants HardSigmoid and HardTanh. Moreover, savings in the total number of operations were observed within the forward and backward stages. The article is addressed to all deep learning enthusiasts but primarily to programmers and students interested in the design of deep models. For example, it illustrates in code snippets possible ways to implement ISBE functionality but also formally proves that the SoftMax trick only applies to the class of dilated SoftMax functions with relocations.


Introduction
A deep model is a kind of mental shortcut ( [1]), broadly understood as a model created in deep learning of a certain artificial neural network N , designed for a given application.What, then, is an artificial neural network ( [2]), its deep learning ( [3,4]), and what applications ( [5]) are we interested in?
From a programmer's perspective, an artificial neural network is a type of data processing algorithm ( [6]), in which subsequent steps are carried out by configurable computational units, and the order of processing steps is determined by a directed graph of connections without loops.
At the training stage, each group of input data X, i.e., each group of training examples, first undergoes the inference (forward) phase on the current model, i.e., processing through the network N at its current parameters W .As a result, network outputs Y ← F N (X; W ) are produced ( [7]).

INFERENCE
After the inference phase comes the model update phase, where the current model is modified (improved) according to the selected optimization procedure ( [8]).The model update phase begins with calculating the loss (cost) value Z ← L(Y, Y • ) defined by the chosen loss function L as well as the inference outcome Y and the target result Y • .∂W .Knowing this gradient, the optimizer will make the actual modification of W in a direction that also takes into account the values of gradients obtained for previous training batches.
Calculating the gradient with respect to parameters actually assigned to different computational units required the development of an efficient algorithm for its propagation in the opposite direction to inference ( [9,10]).
Just as in the inference phase, each unit U has its formula Y u ← F U (X u , W u ) for processing data from input X u to output Y u with parameters W u , so in the backward gradient propagation phase, it must have a formula X u , W u ← F U (Y u ) for transforming the gradients assigned to its outputs Y u into gradients assigned to its inputs X u and its parameters W u .Based on such local rules of gradient backpropagation and the created computation graph, the backpropagation algorithm can determine the gradients of the cost function with respect to each parameter in the network.The computation graph is created during the inference phase and is essentially a stack of links between the arguments and results of calculations performed in successive units ( [10,11]).
Deep learning is precisely a concert of these inference and update phases in the form of gradient propagation, calculated for randomly created groups of training examples.These phases, intertwined, operate on multidimensional deep tensors (arrays) of data, processed with respect to network inputs and on deep tensors of gradient data, processed with respect to losses, determined for the output data of the trained network.
Here, by a deep tensor, we mean a multidimensional data array that has many feature maps, i.e., its size along the feature axis is relatively large, e.g., 500, which means 500 scalar feature maps.We then say that at this point in the network, our data has a deep representation in a 500-dimensional space.
As for the applications we are interested in this work, the answer is those that have at least one requirement for classification ( [12]).An example could be crop detection from satellite images ( [13]), building segmentation in aerial photos [14], but also text translation ( [15]).Classification is also related to voice command recognition ( [16]), speaker recognition ( [17]), segmentation of the audio track according to speakers ( [18]), recognition of speaker emotions with visual support ( [19]), but also classification of objects of interest along with their localization in the image ( [20]).
It may be risky to say that after 2015, in all the aforementioned deep learning classifiers, the cost function takes the form of a composition of the Sof tM ax function ( [21]) and the CrossEntropy function, i.e., cross-entropy ( [22]).The SoftMax unit normalizes the scores predicted by the classifier model for the input object, into softmax scores that sum up to one, which can be treated as an estimation of the conditional probability distribution of classes.Meanwhile, cross-entropy measures the divergence (i.e., divergence) of this estimation from the target probability distribution (class scores).In practice, the target score may be taken from a training set prepared manually by a so-called teacher ( [23]) or may be calculated automatically by another model component, e.g., in the knowledge distillation technique ( [24]).
For K classes and n b training examples, the Sof tM ax function is defined for the raw score matrix X ∈ R n b ×K as: , where the notation [K] denotes any K-element set of indices -in this case, they are class labels.
The CrossEntropy function on the matrix Y, Y • ∈ R n b ×K is defined by the formula: Classifier loss function: Separated Implementation When classifiers began using a separated implementation of the combination of the normalization unit SoftMax and the unit CrossEntropy, it quickly became evident in practice that its implementation had problems with scores close to zero, both in the inference phase and in the backward propagation of its gradient.Only the integration of CrossEntropy with normalization SoftMax eliminated these inconveniences.The integrated approach has the following form: The integrated functionality of these two features has the following redundant mathematical notation: This redundancy in notation was helpful in deriving the equation for the gradient backpropagation for the integrated loss function CrossEntropy • Sof tM ax ( [25]).The structure of this paper is as follows: 1.In the second section titled ISBE Functionality, the conditions that a normalization unit must meet for its combination with a cross-entropy unit to have a gradient at the input equal to the difference in soft scores: X = Y − Y • are analyzed.Then the definition of ISBE functionality is introduced, which in the inference phase (I) normalizes the raw score to a soft score (S), and in the backward propagation phase (B) returns an error (E), equal to the difference in soft scores.It is also justified why in the case of the Sof tM ax normalization function, the ISBE unit has, from the perspective of the learning process, the functionality of the integrated unit CrossEntropy • SoftMax.
2. In the third section, using the example of the problem of recognizing handwritten digits and the standard MNIST(60K) image collection ( [26]), numerous experiments show that in addition to the obvious savings in computational resources, in the case of five activations serving as normalization functions, the classifier's effectiveness is not lower than that of the combination of the normalization unit Softmax and the unit Cross Entropy.This ISBE property was verified for the activation units Softmax, Sigmoid, Hardsigmoid, and Tanh and Hardtanh.
3. The last fourth section contains conclusions.
4. Appendix A, titled Cross-Entropy and Softmax Trick Properties, contains the formulation and proof of the theorem on the properties of the softmax trick.

ISBE Functionality
The ISBE functionality is a proposed simplification of the cost function, combining the SoftMax normalization function with the cross-entropy function, hereafter abbreviated as CE all .Its role is to punish those calculated probability distributions that significantly differ from the distributions of scores proposed by the teacher.
To understand this idea, let's extend the inference diagram for CE all with the backward propagation part for the gradient.We consider this diagram in its separated version, omitting earlier descriptions for the diagram (1): (2) The meaning of variables X, Y, Y • , Z and Z, Y , X appearing in the above diagram (2): X raw score at the input of the normalization function preceding cross-entropy CE, The key formula here is X ← (Y − Y • ).Its validity comes from the mentioned theorem and the formula ( 6) associated with the softmax trick property.
The equation ( 5) on the form of the Jacobian of the normalization unit is both a sufficient and necessary condition for its combination with the cross-entropy unit to ensure the equality (6).Moreover, this condition implies that an activation function with a Jacobian of the softmax type is a SoftMax function with optional relocation.
Theorem 1 leads us to a seemingly pessimistic conclusion: it is not possible to seek further improvements by changing the activation and at the same time expect the softmax trick property to hold.Thus, the question arises: what will happen if, along with changing the activation unit, we change the cross-entropy unit to another, or even omit it entirely?
In the ISBE approach, the aforementioned simplification of the CE all cost function involves precisely omitting the cross-entropy operation in the inference stage and practically omitting all backward operations for this cost function.So what remains?The answer is also an opportunity to decode the acronym ISBE again: 1.In the inference phase (I), we normalize the raw score X to Y = Sof tM ax(X), characterized as a soft score (S).
2. In the backward propagation phase (B), we return an error (E) equal to the difference between the calculated soft score and the target score, i.e., X .
Why can we do this and still consider that in the case of the SoftMax activation function, the value of the gradient transmitted to the network is identical: The answer comes directly from the property X CE all = Y − Y • , formulated in equation (6), which was defined in the theorem 1 as the softmax trick property.
We thus have on the left the following diagram of data and gradient backpropagation through such a unit.On the right we have its generalization to a ScoreNormalization unit instead of SoftMax unit.
Which activation functions should we reach for in order to test them in the ISBE technique?
1.The SoftMax activation function should be the first candidate for comparison, as it theoretically guarantees behavior comparable to the system containing cross-entropy.
2. Activations should be monotonic, so that the largest value of the raw score remains the largest score in the soft score sequence.
4. The activation function should not map two close scores to distant scores.For example, normalizing a vector of scores by projecting onto a unit sphere in the p-th Minkowski norm meets all above conditions, however, it is not stable around zero.Normalization x ∥x∥p maps, for example, two points ϵ, −ϵ distant by 2 • ∥ϵ∥ p to points distant exactly by 2, thus changing their distance 1  ∥ϵ∥p times, e.g., a million times, when ∥ϵ∥ p = 10 −6 .This operation is known in Pytorch library as normalize function.
The experiments conducted confirm the validity of the above recommendations.The Pytorch library functions softmax, sigmoid, tanh, hardsigmoid, hardtanh meet the above three conditions and provide effective classification at a level of effectiveness higher than 99.5%, comparable to CrossEntropy • SoftMax.In contrast, the function normalize gave results over 10% worse -on the same MNIST(60K) collection and with the same architectures.
What connects these good normalization functions F : R K → R K , of which two are not even fully differentiable?Certainly, it is the Lipschitz condition occurring in a certain neighborhood of zero ( [27]): Note that the Lipschitz condition meets the expectations of the fourth requirement on the above list of recommendations for ISBE.Moreover, we do not expect here that the constant c be less than one, i.e., that the function F has a narrowing character.
We need also a recommendation for teachers preparing class labels, which we represent as vectors blurred around the base vectors of axes 1. example blurring value µ, e.g., µ = 10 −6 : 2. when the range of activation values is other than the interval [0, 1], we adjust the vector ẽi to the new range, e.g., for tanh the range is the interval (−1, +1) and then the adjustment has the form: Finally, let's take a look at the code for the main loop of the program implemented on the Pytorch platform.Of course, the above code snippets are only intended to illustrate how easy it is to add the functionality of ISBE to an existing application.

Experiments
What do we want to learn from the planned experiments?We already know from theory that in the case of the SoftMax activation, we cannot worsen the parameters of the classifier using cross-entropy, both in terms of success rate and learning time.
Therefore, we first want to verify whether theory aligns with practice, but also to check for which normalization functions the ISBE unit does not degrade the model's effectiveness compared to CE all .
The learning time t ISBE should be shorter than t CE , but to be independent of the specific implementation, we will compare the percentage of the backpropagation time in the total time of inference and backpropagation: (3) τ .= backpropagation time inference time + backpropagation time • 100% We evaluate the efficiency of the ISBE idea on the standard MNIST(60K) image collection and the problem of their classification.
From many quality metrics, we choose the success rate (also called accuracy), defined as the percentage of correctly classified elements from the test collection MNIST(10K) We want to know how this value changes when we choose different architectures, different activations in the ISBE technique, but also different options for aggregating cross-entropy over the elements of the training batch.Thus, we have the following degrees of freedom in our experiments: 1. Two architecture options • Architecture N 0 consists of two convolutions C and two linear units F , of which the last one is a projection from the space of deep feature vectors of dimension 512 to the space of raw scores for each of the K = 10 classes: means 32 convolutions with 3x3 masks, sampled with a stride of 2, D 20 DropOut -a unit zeroing 20% of tensor elements, F 512 a linear unit with a matrix A ∈ R ?×512 , here ?= 64 -it is derived from the shape of the tensor produced by the previous unit .
• Architecture N 1 consists of two blocks, each with 3 convolutions -it is a purely convolutional network, except for the final projection: Note that the last convolution in each block has a p requirement for padding, i.e., filling the domain of the image with additional lines and rows so that the image resolution does not change.
2. Three options for reducing the vector of losses in the CrossEntropyLoss unit: none, mean, sum.

Five options for activation functions used in the ISBE technique:
• SoftMax: • Tanh: • HardTanh: • Sigmoid: • HardSigmoid: The results of the experiments, on the one hand, confirm our assumption that the conceptual Occam's razor, i.e., the omission of the cross-entropy unit, results in time savings τ , and on the other hand, the results are surprisingly positive with an improvement in the metric of success rate α in the case of hard activation functions HardT anh and HardSigmoid.It was observed that only the option of reduction by none behaves exactly according to theory, i.e., the success rate is identical to the model using Sof tM ax normalization.Options mean and sum for the model with entropy are slightly better than the model with softmax.
The consistency of models in this case means that the number of images incorrectly classified out of 10 thousand is the same.In the experiments, it was not checked whether it concerns the same images.A slight improvement, in this case, meant that there were less than a few or a dozen errors, and the efficiency of the model above 99.6%meant at most 40 errors per 10 thousand of test images.

Comparison of time complexity
We compare time complexity according to the metric given by the formula (3).In the context of time, the table 1 clearly shows that the total time share of backpropagation, obviously depending on the complexity of the architecture, affects the time savings of the ISBE technique compared to CrossEntropyLoss -table 2. The absence of pluses in this table, i.e., the fact that all solutions based on ISBE are relatively faster in the learning phase, is an undeniable fact.
The greatest decrease in the share of backpropagation, over 3%, occurs for the Sigmoid and Sof tM ax activations.The smallest decrease in architecture N 0 is noted for the soft (soft) normalization function T anh and its hard version HardT anh.This decrease refers to cross-entropy without reduction, which is an aggregation of losses calculated for all training examples in a given group, into one numerical value.
Inspired by the theorem 1 , which states that the relocation of the Sof tM ax function preserves the softmax trick property, we also add data to the table 1 for the network N r 1 .This network differs from the N 1 network only in that the normalization unit has a trained relocation parameter.In practice, we accomplish training with relocation for normalization by training with the relocation of the linear unit immediately preceding it.This is done by setting its parameter: bias=True.
As we can see, the general conclusion about the advantage of the ISBE technique in terms of time reducing for the model with the relocation of the normalization function, is the same.

Comparison of classifier accuracy
Comparison of classifier accuracy and differences in this metric are contained in tables 3 and 4. The accuracy is computed according the formula (4).
The number of pluses on the side of ISBE clearly exceeds the number of minuses.The justification for this phenomenon requires separate research.Some light will be shed on this aspect by the analysis of learning curves -the variance in the final phase of learning is clearly lower.The learning process is more stable.In the table 4, we observe that, with the exception of the function Sof tM ax, which on several images of digits performed worse than the model with cross-entropy, the soft activations have an efficiency slightly or significantly better.However, we are talking about levels of tenths or hundredths of a percent here.The largest difference noted for the option softmax was 15 hundredths of a percent, meaning 15 more images correctly classified.Such differences are within the margin of statistical error.
The use of relocation for the normalization functionof does not provide a clear conclusion -for some models there is a slight improvement, for others a slight deterioration.It is true that the ISBE unit with sigmoid activation achieved the best efficiency of 99.69%, but this is only a matter of a few images.
Within the limits of statistical error, we can say that the ISBE technique gives the same results in recognizing MNIST classes.Its advantages are: • of decrease time in the total time, • simplification of architecture, and therefore plaing the philosophical role of Occam's razor.

Visual analysis
Further analysis of the results will be based on the visual comparison of learning curves.
First, let's see on three models cross-entropy-mean, softmax, sigmoid their loss and efficiency curves obtained on training data MNIST(54K) and on data intended solely for model validation MNIST(6K) .These two loss curves are calculated after each epoch.We supplement them with a loss curve calculated progressively after each batch of training data (see figure 1).
Let us note the correct course of the train loss curve with respect to progressive loss curve -both curves are close.The correct course is also for validation loss curve -the validation curve from about epoch 30 is below the training curve maintaining a significant distance.This is a sign that the model is not overfitted.This effect was achieved only after applying a moderate input image augmentation procedure.
Correct behavior of learning curves was recorded both for the modesl with entropy and for models with the ISBE unit.This also applies to classifier performance curves.
For such defined measures, it turns out that only the option of reduction by summing has a different range of variability, and therefore it is not on the figure 2.
2. In the case of classifier accuracy, a common percentage scale does not exclude placing all eight curves for each considered architecture.However, due to the low transparency of such a figure, it is also worth juxtaposing different groups of curves of the dependency α(e).The accuracy α of the classifier MNIST(60K) is calculated on the test set MNIST(10K).
Sets of curves, which we visualize separately for architectures N 0 , N 1 are: For another research there are left practical aspects of more general theorem 2 implying that dilated and relocated versions of SoftMax, are the only ones having the property of dilated softmax trick.
Should we, therefore, celebrate this unique relationship between activation and cost function?In this work, I have shown that it is rather beneficial to use the final effect of the action of this pair, namely the linear value equal to Y − Y • , which can be calculated without their participation.This is exactly what the ISBE unit does -it calculates the soft score vector in the forward step to return in backward step its error from the target score.
To determine the normalized score, the ISBE unit can use not only the SoftMax function, as it is not necessary to meet the unity condition, i.e., to ensure a probability distribution as scores of the trained classifier.At least four other activation functions Sigmoid, Tanh and their hard versions HardSigmoid and HardTanh perform no worse.The choice of these final activations was rather a matter of chance, so researchers face further questions.How to normalize raw scores and how to appropriately represent (encode) class labels in relation to this normalization, so as not to degrade the classifier's results?What properties should such normalization functions have?Experiments suggest that meeting the Lipschitz condition in the vicinity of zero may be one of these properties.
The theoretical considerations presented prove that the ISBE unit in the process of deep model learning correctly simulates the behavior of the CrossEntropy unit preceded by the SoftMax normalization.
The experiments showed that the ISBE unit saves the time time of forward and backward stage up to 3%, and the effectiveness of the classifier model remains unchanged within the margin of statistical error.
On the other hand, the examples of code fragments showed that the programmer's time spent on introducing the ISBE technique to his/her program instead of CrossEntropyLoss is negligible.In matrix notation ( [30]), the property of softmax trick has a longer proof, as we first need to calculate the Jacobian of the Sof tM ax function ( [31]), which is ∂y ∂x = diag The following theorem fully characterizes functions that have the softmax trick property.
Theorem 1 (on the properties of softmax trick).
For a differentiable function F : R K → R K , the following three properties are equivalent: 1. F is a SoftMax function with relocation, if there exists a reference point c ∈ R K , such that for every x ∈ R K : 2. F has a softmax-type Jacobian, if for every x ∈ R K : Proof of Theorem 1.
We prove the implications in the following order: The theorem 1 can be generalized to dilated version of the SoftMax function.It is reformulated in the form of theorem 2. The proof of the generalized theorem can be easily generalized, too.
Theorem 2 (on the properties of dilated softmax trick).
For a differentiable function F : R K → R K , the following three properties are equivalent: 1. F is a SoftMax function with dilation and relocation, there exists a reference point c ∈ R K , and dilation vector d ∈ R K , ∀i, d i ̸ = 0 such that for every x ∈ R K : 2. F has a dilated softmax-type Jacobian, if there exists a dilation vector d ∈ R K such that for every x ∈ R K : INFERENCEX; W − −− → F N Y ←F N (X;W ) − −−−−−−−− → • • • model update -start Y,Y • − −− → L Z←L(Y,Y • ) −−−−−−−→The loss Z depends (indirectly through Y ) on all parameters W , and what conditions the next step of the update phase is the determination of sensitivity W of the loss function L to their changes.The mathematical model of sensitivity is the gradient W . = ∂L BACKPROPAGATIONX; W ←F N (Y ;X,Y,W ) ← −−−−−−−−−−−−− − F N Y u←F U (Y u;Xu,Yu,Wu) ← −−−−−−−−−−−−−−−−− − F U Y u ← − − • • • −−−−− → as by STNN notation ([28]), for instance

Figure 1 :
Figure 1: Learning curves on training and validation data for the N 1 network and three models: cross-entropy-mean, softmax, sigmoid.The horizontal reference line represents the accuracy on test data computed after the last epoch.
1.This is what the code looks like when loss_function is chosen as nn.CrossEntropyLoss: If we prefer to have a visually shorter loop, then by introducing the variable soft_function and extending the class DataProvider with matching target labels for a given soft option, we finally get a compact form:

Table 1 :
Comparison of the metric τ , i.e., the percentage share of backpropagation time in the total time with inference.The share τ CE of cross-entropy with three types of reduction is compared with five functions of soft normalization.The analysis was performed for architectures N 0 and N 1 .

Table 2 :
Metric ∆τ .= τ ISBE − τ CE , i.e., the decrease in the percentage share of backpropagation time in the total time with inference.The analysis was performed for architectures N 0 and N 1 .

Table 3 :
In the table, the success rate of three classifiers based on cross-entropy with different aggregation options is compared with the success rate determined for five options of soft score normalization functions.The analysis was performed for architectures N 0 and N 1 .

Table 4 :
Change in success rate between models with cross-entropy and models with soft score normalization function.The analysis was performed for architectures N 0 and N 1 .