Next Article in Journal
Correction: Gao, H.; Yu, Q. Research on Computation Offloading and Resource Allocation Strategy Based on MADDPG for Integrated Space–Air–Marine Network. Entropy 2025, 27, 803
Next Article in Special Issue
Entropic Dynamics Approach to Quantum Electrodynamics
Previous Article in Journal
On the Security and Efficiency of TLS 1.3 Handshake with Hybrid Key Exchange from CPA-Secure KEMs
Previous Article in Special Issue
Geometry of Statistical Manifolds
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mirror Descent and Exponentiated Gradient Algorithms Using Trace-Form Entropies

1
Systems Research Institute of Polish Academy of Science, Newelska 6, 01-447 Warsaw, Poland
2
Department of Electrical Engineering, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
3
Department of Electronic and Information Engineering, Tokyo University of Agriculture and Technology, Koganei-shi 184-8588, Japan
4
Riken Artificial Intelligence Project (AIP), 1 Chome-4-1, Nihonbashi 103-0027, Japan
5
Sony Computer Science Laboratories, Tokyo 141-0022, Japan
6
Department of Signal Processing and Communications, Universidad de Sevilla, 41092 Seville, Spain
*
Authors to whom correspondence should be addressed.
Entropy 2025, 27(12), 1243; https://doi.org/10.3390/e27121243
Submission received: 24 October 2025 / Revised: 27 November 2025 / Accepted: 27 November 2025 / Published: 8 December 2025

Abstract

This paper introduces a broad class of Mirror Descent (MD) and Generalized Exponentiated Gradient (GEG) algorithms derived from trace-form entropies defined via deformed logarithms. Leveraging these generalized entropies yields MD and GEG algorithms with improved convergence behavior, robustness against vanishing and exploding gradients, and inherent adaptability to non-Euclidean geometries through mirror maps. We establish deep connections between these methods and Amari’s natural gradient, revealing a unified geometric foundation for additive, multiplicative, and natural gradient updates. Focusing on the Tsallis, Kaniadakis, Sharma–Taneja–Mittal, and Kaniadakis–Lissia–Scarfone entropy families, we show that each entropy induces a distinct Riemannian metric on the parameter space, leading to GEG algorithms that preserve the natural statistical geometry. The tunable parameters of deformed logarithms enable adaptive geometric selection, providing enhanced robustness and convergence over classical Euclidean optimization. Overall, our framework unifies key first-order MD optimization methods under a single information-geometric perspective based on generalized Bregman divergences, where the choice of entropy determines the underlying metric and dual geometric structure.

This paper is dedicated to Professor Shun-Ichi AMARI in honor of his 90th birthday.

1. Introduction

Mirror descent (MD), initially proposed by Nemirovsky and Yudin [1], has become an increasingly popular topic in optimization, artificial intelligence, and machine learning domains [2,3,4,5,6]. Its profound success stems not merely from its algorithmic efficiency, but also from deep mathematical connections to information geometry and the natural statistical structure underlying optimization problems. These connections, particularly to Amari’s Natural Gradient (NG) method, reveal that effective optimization is fundamentally about respecting the intrinsic geometry of the parameter space rather than imposing artificial Euclidean constraints [7,8,9].
The central motivation for our research emerges from a fundamental insight in information geometry: optimized learning algorithms should adapt to the Fisher information metric of the underlying statistical manifold. This principle, pioneered by Amari in the context of neural networks, establishes that the steepest descent direction on a statistical manifold is not the Euclidean gradient, but rather the natural gradient, the direction that accounts for the curvature induced by the Fisher information matrix [10,11].

1.1. The Information Geometry Perspective

The connection between Mirror Descent and Natural Gradient runs deeper than algorithmic similarity—it represents a fundamental mathematical equivalence that has been rigorously established [2,12,13]. Our work extends this principle by showing that trace-form entropies induce natural Fisher—like metrics that can be even more appropriate for specific problem structures. Through deformed logarithms, we can systematically explore the space of possible geometries and the possibility of discovering optimal choices of hyperparameters for given data distributions [14,15].

1.2. Challenge of Geometric Selection

While the power of geometric optimization is well established, a fundamental challenge remains: how to select the appropriate geometry for a given optimization problem? Classical approaches require domain expertise and manual tuning, limiting their applicability. Our approach addresses this through parameterized entropy families, where hyperparameters control the geometric structure [2].
The theoretical foundation of our approach rests on the connection between Bregman divergences and exponential families established in information geometry [8]. This connection indicates that choosing a generalized entropy is equivalent to selecting an appropriate exponential family structure for specific optimization problems. See also [16] for an extension of the logarithmic divergences extending to Bregman divergences. Our deformed logarithms allow us to systematically explore the space of possible exponential families, enabling discovery of optimal statistical models.
The Exponentiated Gradient (EG) and its extensions emerge as a specific and powerful instantiation of the Mirror Descent framework when the mirror map is constructed from generalized entropies and deformed logarithms. This connection is far from superficial—it represents a fundamental mathematical relationship that unifies additive and multiplicative gradient updates within a single theoretical framework [15,17,18,19,20,21].
It is important to note that Mirror Descent updates can be reparameterized as Gradient Descent in appropriately chosen coordinate systems [2,12]. This reveals that the computational complexity of Natural Gradient can be considerably reduced. Our deformed logarithms approach gives some insight by showing that deformed logarithms naturally induce reparameterizations that preserve geometric structure while enabling efficient computation and provide implicit regularization through the choice of entropy or deformed logarithm.

1.3. Research Contributions and Scope

In this work, we systematically investigate the theoretical foundations and practical implications of employing trace-form entropies and deformed logarithms in the Mirror Descent framework.
Our primary contributions include:
  • Mathematical Framework: We establish a comprehensive mathematical foundation connecting generalized entropies, deformed logarithms, and Mirror Descent updates, providing explicit formulations for numerous well-established trace entropy families.
  • Algorithmic Innovations: We derive novel Generalized Exponentiated Gradient (GEG) algorithms with generalized multiplicative updates that leverage the flexibility of hyperparameter-controlled deformed logarithms, enabling adaptation to problem geometry.
  • The significance of this research extends beyond algorithmic development; it opens new avenues for understanding the geometric foundations of optimization and provides practical tools for addressing increasingly complex machine learning challenges. The unifying theoretical framework connects optimization theory, information geometry, statistical physics, and practical machine learning, opens up new research directions, and provides principled approaches to algorithm design that respect the natural geometric structure of optimization problems.

2. Preliminaries: Mirror Descent (MD) and Standard Exponentiated Gradient (EG) Updates

Notations: Vectors are denoted by boldface lowercase letters, e.g., w R N , where for any vector w, we denote its i-th entry by w i . For any vectors w , v R N , we define the Hadamard product as w v = [ w 1 v 1 , , w N v N ] T and w α = [ w 1 α , , w N α ] T . All operations for vectors like multiplications and additions are performed componentwise. The function of a vector is also applied for any entry of the vectors, e.g., f ( w ) = [ f ( w 1 ) , f ( w 2 ) , , f ( w N ) ] T . The N-dimensional real vector space with nonnegative real numbers is denoted by R + N . We let w ( t ) denote the weight or parameter vector as a function of time t. The learning process advances in iterative steps, where during step t we start with the weight vector w ( t ) = w t and update it to a new vector w ( t + 1 ) = w t + 1 . We define [ x ] + = max { 0 , x } , and the gradient of a differentiable cost function as w L ( w ) = L ( w ) / w = [ L ( w ) / w 1 , , L ( w ) / w N ] T . In contrast to deformed logarithms defined later, the classical natural logarithm will be denoted by ln ( x ) .

2.1. Problem Statement

We consider the constrained optimization problem:
w t + 1 = arg min w R + N L ( w ) + 1 η D F ( w | | w t ) ,
where L ( w ) is a continuously differentiable loss function, η > 0 is the learning rate and D F ( w | | w t ) is the Bregman divergence induced by a strictly convex generating function F ( w ) (used here as a regularizer) [2,22]. The Bregman divergence provides the geometric foundation for Mirror Descent algorithms:
D F ( w | | w t ) = F ( w ) F ( w t ) ( w w t ) T f ( w t ) ,
where the generating (or potential) function F ( w ) is a continuously-differentiable, strictly convex function defined on the convex domain, while f ( w ) = w F ( w ) is the mirror map, called also the link function, which is a strictly monotonically increasing function. For fundamental properties and for some extensions, see, e.g., [23,24,25,26].
The Bregman divergence measures the difference between F ( w ) and its first-order Taylor approximation around w t , providing a natural measure of geometric proximity that respects the curvature induced by F. This geometric structure is intimately connected to information geometry—different choices of F correspond to different Riemannian metrics on the parameter manifold. The Bregman divergence D F ( w | | w t ) arising from generating (potential) function F ( w ) can be viewed as a measure of curvature. The Bregman divergence includes many well-known divergences commonly used in practice, namely, the squared Euclidean distance, Kullback–Leibler divergence (relative entropy), Itakura-Saito distance, beta divergence and many more [10,14,27,28,29].

2.2. Mirror Descent Update Rules and Geometric Interpretation

Setting the gradient of the objective in Equation (1) to zero yields the implicit update:
f ( w t + 1 ) = f ( w t ) η w L ( w t + 1 ) ,
or equivalently
w t + 1 = f ( 1 ) f ( w t ) η w L ( w t + 1 ) ,
where f ( 1 ) is inverse function of the link function. Note that when F is separable and continuous, the inverse function F 1 is defined globally (by the inverse function theorem). In general, the implicit function theorem only guarantees local inversion of multivariate functions but not the existence of global inverse functions. However, when F is a multivariate convex function of Legendre-type so is its convex conjugate F * and their gradients are reciprocal to each others globally: F = ( F * ) 1 and F * = ( F ) 1 .
Assuming that w L ( w t + 1 ) w L ( w t ) , we obtain the explicit Mirror Descent update [2,5]:
w t + 1 = f ( 1 ) f ( w t ) η w L ( w t )
= F ( 1 ) F ( w t ) η w L ( w t ) .
In MD, we map our primal point w to the dual space (through the mapping via the link function f ( w ) = F ( w ) ) and take a step in the direction given by the gradient of the function, then we map back to the primal space by using the inverse function of the link function. The advantage of using Mirror Descent (MD) over Gradient Descent is that it takes into account the geometry of the problem through suitable choice of a link function.
Dual Space Interpretation: Mirror Descent operates by
  • Mapping to dual space: Θ = f ( w ) = F ( w ) ,
  • Taking gradient step: Θ t + 1 = Θ t η L ( w t ) ,
  • Mapping back to primal: w t + 1 = f 1 ( Θ t + 1 ) .
This three-step process naturally incorporates problem geometry through the choice of link function f (Figure 1).
For example, consider F ( w ) = i w i log w i , the Shannon negative entropy. The link function is f ( w ) = F ( w ) = [ 1 + log w i ] i with inverse map f 1 ( Θ ) = e Θ i j e Θ j i . The corresponding mirror update is the exponentiated gradient update:
w t + 1 = w t exp η w L ( w t ) .
This is a standard and useful algorithm for optimization on the probability simplex that is recovered as the mirror descent with respect to the Kullback–Leibler (KL) divergence (a Bregman divergence). The underlying geometric structure is the KL Hessian geometry, an example of dually flat space in information geometry.
Note that when the generating function F is separable across its coordinates (i.e., F ( w ) = i F ( w i ) ), the Hessian matrix 2 F ( w ) is diagonal.

2.3. Continuous-Time Formulation and Natural Gradient Connection

The continuous-time limit (as Δ t 0 ) yields the mirror flow ODE:
d f ( w ( t ) ) d t = μ w L ( w ( t ) ) ,
where μ = η / Δ t > 0 is the learning rate for continuous-time learning, and f ( w ) = F ( w ) is a suitably chosen link function [2]. Using the chain rule, we can write mirror flow as follows
d f w d t = d f ( w ) d w d w d t = diag d f ( w ) d w d w d t = μ w L ( w ( t ) ) .
Hence, we can obtain continuous-time MD update in alternative form:
d w d t = μ diag d f ( w ) d w 1 w L ( w t ) = μ [ 2 F ( w ) ] 1 w L ( w ( t ) ) .
This reveals that Mirror Descent in continuous-time is equivalent to Natural Gradient descent with the Riemannian metric H F ( w ) = [ 2 F ( w ) ] 1 . This connection, established rigorously in [2,13], shows that geometric optimization methods are fundamentally unified.

2.4. Discrete Natural Gradient Form

The discrete version becomes
w t + 1 = [ w t η diag d f ( w t ) d w t 1 w L ( w t ) ] +
where diag d f ( w ) d w 1 = diag d f ( w ) d w 1 1 , , d f ( w ) d w N 1 , which we term Mirror-less Mirror Descent (MMD), representing a first-order approximation to second-order Natural Gradient methods (10) [30].
It should be noted that the above-defined diagonal matrix can be considered as the inverse of the Hessian matrix, if it exists and has positive diagonal entries for a specific set of parameters. The MMD is a special form of Natural Gradient Descent (NGD) [2,7,13].
F. Nielsen provided a geometric interpretation of NG and its connections with the Riemannian gradient, the mirror descent, and the ordinary additive gradient descent [31].

2.5. Canonical Examples and Some Geometric Insights

Case 1: For F ( w ) = w 2 2 / 2 = 1 2 i = 1 N w i 2 and link function f ( w ) = w F ( w ) = w , we obtain the standard (additive) gradient descent
d w ( t ) d t = μ t w L ( w ( t ) )
and its time-discrete approximate version
w t + 1 = w t η t w L ( w t ) .
Case 2: For F ( w ) = i = 1 N w i ln ( w i ) w i and corresponding (componentwise) link function f ( w ) = ln ( w ) we obtain a (multiplicative) Exponentiated Gradient (EG) update also called the unnormalized EG update (or EGU) [20]:
d ln w ( t ) d t = μ w L ( w ( t ) ) , w ( t ) > 0 t .
In this sense, the unnormalized exponentiated gradient update (EGU) corresponds to the discrete-time version of the continuous ODE, obtained via Euler’s rule:
w t + 1 = exp ln ( w t ) μ Δ t w L ( w t ) = w t exp η w L ( w t ) ,
where ⊙ and exp are componentwise multiplication and componentwise exponentiation, respectively, and η = μ Δ t > 0 is the learning for discrete-time updates. This multiplicative update naturally preserves positivity constraints and corresponds to the natural geometry of the probability simplex.

2.6. Motivation for Using Parameterized Deformed Logarithms

Traditional Mirror Descent methods suffer from geometric rigidity—the fixed choice of mirror map f cannot adapt to diverse problem structures or data distributions. This limitation motivates our investigation of parameterized mirror maps based on trace-form entropies.
Adaptive Geometric Framework: Our approach addresses this fundamental limitation by introducing hyperparameter-controlled mirror maps f Θ ( w ) that can:
  • Adapt to statistical properties of training distributions.
  • Interpolate between different geometries (e.g., Euclidean, exponential family, power-law).
  • Provide automatic regularization through geometric bias.
  • Enable systematic geometry exploration rather than ad-hoc selection.
Information-Theoretic Foundation: The connection between exponential families and Bregman divergences suggest that optimal mirror maps should reflect the underlying statistical structure of optimization problems. In fact, trace-form entropies provide systematic frameworks for discovering these optimal geometric structures.
There are many potential choices of mirror map f ( w ) that can model the geometry of various optimization problems and adapt to the distribution of training data. In high dimensions (large-scale optimization), it can be advantageous to abandon the Euclidean geometry to improve convergence rates and performance. Using mirror descent with an appropriately chosen function we can obtain a considerable improvement.

3. Why Trace Entropies and Deformed Logarithms in MD and GEG?

Entropy measures provide natural regularization mechanisms and geometric structures for optimization algorithms. The connection between entropies, information theory, and geometry runs deep; each entropy functional induces a deformed logarithm and a unique Riemannian manifold structure through its associated Fisher information metric [32,33,34,35].
Trace entropies are functionals expressible in the explicit summation form [36,37,38,39,40,41]
S ( p ) = i p i f ( 1 / p i ) ,
where p i are probability values and f ( · ) is a suitable monotonically increasing function. The term “trace” refers to their mathematical structure, which resembles the trace operation for matrices, i.e., a direct summation over individual components.
Trace entropies are intimately connected to deformed logarithms through f ( x ) = log D ( x ) , where log D represents a deformed logarithm function with specific mathematical properties ensuring proper entropic behavior [34,38].
A function log D ( x ) qualifies as a deformed logarithm if it satisfies the following conditions:
  • Domain log D ( x ) : R + R
  • Strictly monotonically increasing: d log D ( x ) d x > 0
  • Concavity (optional): d 2 log D ( x ) d x 2 < 0
  • Scaling and normalization: log D ( 1 ) = 0 , d log D ( x ) d x | x = 1 = 1
  • Duality: log D ( 1 / x ) = log D ˜ ( x ) .
These axioms ensure that deformed logarithms generate well-behaved entropy functionals while providing sufficient mathematical flexibility for geometric adaptation. The concavity requirement ensures that resulting entropies satisfy the maximum entropy principle, while duality guarantees symmetric treatment of probabilities and their reciprocals, which are essential for consistent statistical interpretation.
Remark 1. 
It should be noted that since the generating (potential) function F ( w ) , (which is the integral of the link function f ( w ) ) must be strictly convex, it is sufficient that the link function f ( w ) = log D ( w ) is a strictly monotonically increasing function, i.e., its first derivative must be positive but the negativity of its second derivative is in the general not necessary.
It should be noted that deformed logarithms and their corresponding deformed exponential functions can be flexibly tuned by one or more hyperparameters, whose optimization enables the adaptation to specific data distributions and problem geometries. By tuning/learning these hyperparameters, we adapt to the distribution of training data and/or we can adjust them to achieve desired properties of gradient descent algorithms.
It is of great importance to understand the mathematical structure of the generalized logarithms and its inverse generalized exponentials in order to obtain more insight into the proposed MD or EG update schemes. Motivated by this fact, and to make this paper more self-contained, we systematically revise fundamental properties of the deformed logarithms and their inverses, generalized exponentials and investigate links between them.
We provide in the Appendix A the basics of q-algebra and κ -algebra and calculus in Appendix A.1 [42].

4. MD and GEG Updates Using the Tsallis Entropy and Its Extensions

4.1. Properties of the Tsallis q-Logarithm and q-Exponential

In physics, the Tsallis entropy is a generalization of the standard Boltzmann–Gibbs entropy [34,43,44]. It is proportional to the expectation of the deformed q-logarithm (referred here as the Tsallis logarithm or termed logarithm) of a distribution.
The Tsallis q-logarithm is defined for x > 0 as [45]
log q T ( x ) = x 1 q 1 1 q for x > 0 if q 1 , ln ( x ) for x > 0 if q = 1 for x = 0 q .
The inverse function of the Tsallis q-logarithm is the deformed q-exponential function exp q T ( x ) , defined as follows [45]:
exp q T ( x ) = [ 1 + ( 1 q ) x ] + 1 / ( 1 q ) for x ( , 1 / ( 1 q ) ) if q < 1 , x ( 1 / ( q 1 ) , ) if q > 1 , exp ( x ) for q = 1 .
It is easy to check that these functions satisfy the following relationships:
log q T ( exp q T ( x ) ) = x , ( 0 < exp q T ( x ) < ) ,
exp q T ( log q T ( x ) ) = x , for x > 0 .
Remark 2. 
The q-deformed exponential and logarithmic functions were introduced in Tsallis statistical physics in 1994 [44]. However, the q-deformation is related to the Box–Cox transformation (for q = 1 λ ), which was proposed in 1964 [46].
The plots of the q-logarithm and q-exponential functions for various values of q are illustrated in Figure 2 and Figure 3.
It should be noted that q-functions can be approximated by power series as follows:
log q T ( x ) ln ( x ) + 1 2 ( | 1 q | ) ( ln ( x ) ) 2 + 1 6 ( 1 q ) 2 ( ln ( x ) ) 3 + ,
and
exp q T ( x ) 1 + x + 1 2 q x 2 + 1 6 2 q 2 q x 3
             = exp ( x ) + 1 2 q 1 x 2 + 1 6 2 q 2 q 1 x 3 + O ( x 4 ) .
These functions have the following basic properties [44,47,48,49]:
log q T ( x ) = log 2 q T ( 1 / x )
log q T ( x ) x = 1 x q > 0
2 log q T ( x ) x 2 = q x q + 1 < 0 for q > 0 .
It is easy to prove the following fundamental properties:
log q T ( x y ) = log q T ( x ) + log q T ( y ) + ( 1 q ) log q T ( x ) ) log q T ( y ) if x > 0 , y > 0
exp q T ( x ) exp q T ( y ) = exp q T ( x + y + ( 1 q ) x y ) .
Using these properties we can define nonlinear generalized algebraic forms q-sum and the q-product (for more details about q-algebra, see [47,50]
x q T y = x + y + ( 1 q ) x y , ( x 1 T y = x + y ) ,         
x q y = [ x 1 q + y 1 q 1 ] + 1 / ( 1 q ) if x > 0 , y > 0 ( x 1 T y = x y ) .
Using this notation, and definitions of q-exponential function (16) and q-logarithm (15), we can write the following formulas
exp q T ( x + y ) = exp q T ( x ) q exp q T ( y ) , for 1 + ( 1 q ) x > 0 , 1 + ( 1 q ) y > 0 , 1 + ( 1 q ) ( x + y ) > 0 ,
          exp q T ( log q T ( x ) + y ) = x q exp q T ( y ) ,           for x > 0 , 1 + ( 1 q ) y > 0 , x 1 q + ( 1 q ) y > 0 .
which play a key role in this paper.

4.2. MD and GEG Updates Using the Tsallis q-Logarithm

Let us assume that the link function in Mirror Descent can take the following componentwise form
f q ( w ) = log q T ( w ) , w = [ w 1 , , w N ] T R + N .
In this case the generating function F ( w ) = i w i log q ( w i ) log q 1 ( w i ) and the Bregman divergence is a well-known beta divergence [28]:
D F q ( w t + 1 | | w t ) = i = 1 N w i , t + 1 log q ( w i , t + 1 ) log q ( w i , t ) log q 1 ( w i , t + 1 ) + log q 1 ( w i , t ) = i = 1 N w i , t + 1 w i , t + 1 1 q w i , t 1 q 1 q w i , t + 1 2 q w i , t 2 q 2 q ,
where β = 1 q , β 0 .
Applying Equation (6) and taking into account Formula (31), we obtain a novel generalized exponentiated gradient update referred to as q-GEG or q-MD update
w t + 1 = exp q T log q T ( w t ) η w L ( w t ) = w t q exp q T η w L ( w t ) )
where q q-product defined by Equation (28) is performed componentwise.
The above q-MD update can be written in a scalar (componentwise) form as
w i , t + 1 = w i , t q exp q T η w i L ( w t ) ) = w i , t 1 q + exp q T ( η w i L ( w t ) ) ) 1 q 1 ] + 1 / 1 q
By applying the property (31) and substituting y y / ( 1 ( 1 q ) x ) , we can obtain the following identity
exp q T ( x + y ) = exp q T ( x ) exp q T y 1 + ( 1 q ) x for x , y ( , 1 / ( 1 q ) ) if q < 1 , x , y ( 1 / ( q 1 ) , ) if q > 1 ,
Hence, we obtain a simplified generalized q-GEG update
w i , t + 1 = w i , t exp q T η w i L ( w t ) w i , t 1 q ,
which can be written in a compact vector form
w t + 1 = w t exp q T η t w L ( w t )
where a vector of learning rates
η t = [ η 1 , t , , η N , t ] T ,
has entries η i , t = η / ( 1 + ( 1 q ) log q T ( w i , t ) = η w i , t q 1 .
Remark 3. 
Assuming that learning rate is time variable and it is represented by a vector, i.e., η η w t 1 α , where η t = η w t 1 α β , the proposed MD update takes particular form derived and extensively experimentally tested in our recent publication [15], however using a different approach employing alpha-beta divergences [14,27].

4.3. MD and EG Using Schwämmle–Tsallis (ST) Entropy

Schwämmle and Tsallis proposed two-parameter entropy [51]
S q , q S T ( p ) = i = 1 w p i log q , q S T ( 1 / p i ) , q 1 , q 1 ,
where the deformed logarithm referred to as the ST-logarithm or ST ( q , q ) -logarithm is defined as
log q , q S T ( x ) = log q T ( [ x ] q ) = log q T e log q T ( x ) = 1 1 q exp 1 q 1 q ( x 1 q 1 ) 1
for x > 0 and q q (typically q > 1 and q < 1 ), where [ x ] q = exp ( l o g q ( x ) ) and its inverse function is formulated as a two-parameter deformed exponential, exp q , q ( x )
exp q , q S T ( x ) = 1 + 1 q 1 q ln ( 1 + ( 1 q ) x ) 1 / ( 1 q ) .
Note that if either parameter q or q , or both, take the value of one, the above functions simplifies to Tsallis q-functions (logarithm and exponential), so we can write
   log q , 1 S T ( x ) = log 1 , q S T ( x ) = log q T ( x ) , log 1 , 1 S T ( x ) = ln ( x )
exp q , 1 S T ( x ) = exp 1 , q S T ( x ) = exp q T ( x ) , exp 1 , 1 S T ( x ) = exp ( x ) .
In the special case for q = q , we obtain
log q , q S T ( x ) = 1 1 q exp ( x 1 q 1 ) 1 = 1 1 q exp ( 1 q ) log q T ( x ) 1 , x > 0 , q > 0
and
exp q , q S T ( x ) = ln x ( 1 q ) + 1 + 1 + 1 / ( 1 q ) .
The plots of the ( q , q ) -logarithm and ( q , q ) -exponential for various values of q = q are illustrated in Figure 4.
Moreover, it is easy to prove the following useful properties
log q , q S T ( 1 / x ) = log 2 q , 2 q S T ( x ) ,
d log q . q S T ( x ) d x = x q exp 1 q 1 q ( x 1 q 1 ) > 0 for x > 0 , q , q , q 1 .
Defining ( q , q ) -product as [52]:
x q , q S T y = exp q q log q , q ( x ) + log q , q ( y )
we have the key formulas
exp q , q S T ( x + y ) = exp q , q S T ( x ) q , q exp q , q S T ( y ) ,
exp q , q S T ( log q , q S T ( x ) + y ) = x q , q exp q , q S T ( y ) .
Let us consider now that the link function is defined componentwise as
f q , q ( w ) = log q , q S T ( w ) .
The novel ( q , q ) -GEG update can take the following form
w t + 1 = w t q , q exp q , q ST η w L ( w t ) ) ,
In this case, the update is more complex than in the previous case. An alternative approach is to apply the MMD/NG Formula (10):
w t + 1 = [ w t η diag { w q exp 1 q 1 q ( 1 w 1 q ) } w L ( w t ) ] +
which can be written equivalently in a scalar form as
w i , t + 1 = w i , t η w i , t q exp 1 q 1 q ( 1 w i , t 1 q ) L ( w t ) w i ] +
Remark 4. 
Extension to Three Parameters ( q , q , r ) -Logarithm. Note that by using definition [ x ] q = exp ( log q T ( x ) ) we can write the ST logarithm in a compact form
log q , q S T ( x ) = [ x ] q 1 q 1 1 q = [ exp ( log q T ( x ) ] 1 q 1 1 q
Analogously, we can define
[ x ] q , q = exp ( log q , q S T ( x ) ) .
Hence, we can formulate a three-parameter logarithm as proposed in [53]
log q , q , r C C ( x ) = ( [ x ] q , q 1 q ) 1 r 1 1 r = ( exp ( log q , q S T ( x ) ) 1 r 1 1 r
The plots of the ( q , q , r ) -logarithm and ( q , q , r ) -exponential for coincident values of q, q , and r are illustrated in Figure 5.
In a similar way, as in the previous section, we can drive MD updates using as a link function the above-defined three-parameter logarithm.

5. MD and GEG Using the Kaniadakis Entropy and Its Extensions and Generalizations

5.1. Basic Properties of κ -Logarithm and κ -Exponential

An entropic structure emerging in the context of special relativity, is the one defined by Kaniadakis [36,37] as follows
S κ ( p i ) = i p i log κ K ( 1 / p i ) ,
where a deformed κ -logarithm referred to as the Kaniadakis κ -logarithm is defined as [36,37]:
log κ K ( x ) = x κ x κ 2 κ = 1 κ sinh ( κ ln ( x ) ) if x > 0 and 0 < κ 2 < 1 ln ( x ) if x > 0 and κ = 0 .
The inverse function of the Kaniadakis κ -logarithm is the deformed exponential function exp κ K ( x ) , represented as
exp κ K ( x ) = exp 0 x d y 1 + κ 2 y 2 = 1 + κ 2 x 2 + κ x 1 / κ = exp 1 κ arsinh ( κ x ) 1 < κ < 1 exp ( x ) κ = 0 .
The plots of the κ -logarithm and κ -exponential functions for various values of κ are illustrated in Figure 6 and Figure 7.
Note that the Kaniadakis logarithm can also be expressed in terms of the Tsallis logarithm as
log κ K ( x ) = log 1 + κ T ( x ) + log 1 κ T ( x ) 2 .
These functions have the following fundamental and useful properties [37,41]:
              log κ K ( 1 ) = 1 , log κ K ( 0 + ) = , log κ K ( ) = + ,
log κ K ( 1 / x ) = log κ K ( x ) ,
log κ K ( x λ ) = λ log λ κ K ( x ) ,
       log κ K ( x y ) = y κ log κ K ( x ) + x κ log κ K ( y ) ,
log κ ( exp ( x ) ) = ( 1 / κ ) sinh ( κ x ) ,
ln ( exp κ ( x ) ) = ( 1 / κ ) arsinh ( κ x ) ,
         log κ K ( x ) x = x κ + x κ 2 x = cosh [ κ ln ( x ) ] x > 0 ,
                2 log κ K ( x ) x 2 = κ 1 2 x κ 2 κ + 1 2 x κ 2 < 0 for κ [ 1 , 1 ] .
The last two properties indicate that the Kaniadakis κ -logarithm is monotonically increasing for any value of κ , and for | κ | < 1 , it is additionally a strictly concave function.
The κ -logarithm can be approximated as a power series
log κ ( x ) ln ( x ) + 1 3 ! κ 2 [ ln ( x ) ] 3 + 1 5 ! κ 4 [ ln ( x ) ] 5 + 1 7 ! κ 6 [ ln ( x ) ] 7 , x > 0 .
Furthermore, it is important to note that applying the Taylor series expansion of the κ -exponential we can obtain a simple approximation as
exp κ K ( x ) = 1 + x + 1 2 ! x 2 + 1 3 ! ( 1 κ 2 ) x 3 + 1 4 ! ( 1 4 κ 4 ) x 4 +
= exp ( x ) 1 3 ! κ 2 x 3 1 4 ! 4 κ 4 x 4 + .
Two notable features of the κ -exponential function are that it asymptotically approaches a regular exponential function for small x and asymptotically approaches a power law for large value | x | [37,54]. Specifically,
lim x 0 exp κ ( x ) exp ( x ) ,
lim x ± exp κ ( x ) | 2 κ x | ± 1 / | κ | .
The κ -exponential function has the following basic properties [36,37,41]
exp κ K ( 0 ) = 1 , exp κ K ( ) = 0 + , exp κ K ( ) = + ,
              exp κ K ( ) = 0 + , exp κ K ( + ) = + ,
exp κ K ( x ) exp κ K ( x ) = 1 ,
             exp κ K ( x ) r = exp κ / r K ( r x ) , r R ,
      exp κ K ( x ) x > 0 ,
              2 exp κ K ( x ) x 2 > 0 for κ [ 1 , 1 ] .
The last two properties mean that the Kaniadakis κ -exponential is a monotonically increasing and convex function for a specific range of the parameter κ .
The property (79) emerges as a particular case of the more general one
exp κ K ( x ) exp κ K ( y ) = exp κ K ( x ) κ exp κ K ( y ) ,
    log κ K ( x y ) = log κ K ( x ) κ log κ K ( y ) ,
where κ -addition is defined as
x κ y = x 1 + κ 2 y 2 + y 1 + κ 2 x 2
          x + y + κ 2 2 ( x y 2 + x 2 y ) κ 4 8 ( x y 4 + x 4 y )
By defining and evaluating the κ -product
x κ y = exp κ log κ ( x ) + log κ ( y )
                 = x κ x κ 2 + y κ y κ 2 + 1 + x κ x κ 2 + y κ y κ 2 2 1 / κ
          = exp 1 κ arsinh ( ( x κ x κ + y κ y κ ) / 2 )
       = 1 κ sinh 1 κ arsinh ( κ x ) arsinh ( κ y ) ,
we have the key formulas for our MD application
exp κ K ( x + y ) = exp κ K ( x ) κ exp κ K ( y ) ,
exp κ K ( log κ K ( x ) + y ) = x κ exp κ K ( y ) .
Appendix A.2 gives an overview of the κ -algebra and calculus.

5.2. MD and GEG Using the Kaniadakis Entropy and κ -Logarithm

Let us assume that the link function in Mirror Descent can take the following componentwise form
f κ ( w t ) = log κ K ( w t ) , w = [ w 1 , t , , w N , t ] T R + N .
Note that since the first derivative of log κ K ( w ) is positive and second derivative is negative, the link function is a (componentwise) increasing function and additionally a concave function for κ ( 1 , 1 ] . In this case the generating function:
F κ ( w t ) = i = 1 N 1 2 κ w i , t 1 + κ 1 + κ w i , t 1 κ 1 κ .
Taking into account Formula (6), we obtain a novel κ -GEG update
w t + 1 = w t κ exp κ K η w L ( w t ) )
where κ is a κ -product defined by Equation (87) and is performed componentwise.

5.3. Two-Parameter Logarithms Based on Generalized Kaniadakis–Lissia–Scarfone Entropy

Another important generalized entropy has the following form
S κ , r ( p ) = i = 1 w ( p i ) r + 1 p i κ p i κ 2 κ = i = 1 w p i log κ , r ( p i ) ,
which was introduced by Sharma, Taneja and Mittal (STM) in [55,56,57], and also investigated, independently, by Kaniadakis, Lissia and Scarfone (KLS) in [38,39,58].
Equation (93) mimics the expression of the Boltzmann–Gibbs entropy by replacing the standard natural logarithm ln ( x ) with the two-parametric deformed logarithm log κ , r ( x ) defined as
log κ , r ( x ) = x r x κ x κ 2 κ , x > 0 , r R , for | κ | r 1 / 2 | 1 / 2 | κ | | .
The surface plots of the ( κ , r ) -logarithm for various values of hyperparameters κ and r are illustrated in Figure 8 and Figure 9.
The ( κ , r ) -logarithm can be expressed via the Kaniadakis κ -logarithm and the Tsallis q-logarithm
log κ , r ( x ) = x r x κ x κ 2 κ = x r log κ K ( x ) = x r κ log 1 2 κ T ( x ) .
Obviously, for r = 0 the ( κ , r ) -logarithm simplifies to a Kaniadakis κ -logarithm, and for r = ± | κ | one recovers the Tsallis q-logarithm with q = 1 2 | κ | , ( 0 < q < 2 ).
By introducing a new parameter ω = r / κ or replacing r = ω κ we can represent the logarithm as
log κ , ω ( x ) = x κ ( ω + 1 ) x κ ( ω 1 ) 2 κ ,
which for ω = 0 simplifies to κ -logarithm and with ω = 1 and κ = ( 1 q ) / 2 we have a q-logarithm. This formula indicates that this logarithm smoothly interpolates between a Kaniadakis logarithm and Tsallis logarithm.
Summarizing, the ( κ , r ) -logarithm can be described as follows [38,39]
log κ , r K L S ( x ) = x r + κ x r κ 2 κ if x > 0 for r R , and | κ | r | κ | , log κ K ( x ) = x κ x κ 2 κ if x > 0 for r = 0 , κ [ 1 , 1 ] , κ 0 , log q T ( x ) = x 1 q 1 1 q , if x > 0 for r = κ = ( 1 q ) / 2 , q 1 , q > 0 , ln ( x ) if x > 0 and r = κ = 0 .
It should also be noted the ( κ , r ) -logarithm can be represented approximately by the following power series for relatively small κ :
log κ , r K L S ln ( x ) + 1 2 ( 2 r ) [ ln ( x ) ] 2 + 1 6 ( κ 2 + 3 r 2 ) [ ln ( x ) ] 3 .
The ( κ , r ) logarithm has the following basic properties
log κ , r ( 1 ) = 0 , log κ , r ( 0 + ) = , for r < | κ | , log κ , r ( + ) = + , for r > | κ | ,
log κ , r ( x ) = log κ , r ( 1 / x ) = log κ , r ( x ) ,                  
log κ , r ( x λ ) = λ log λ κ ( x ) ,                          
log κ , r ( x ) x > 0 , for | κ | r | κ | ,                    
2 log κ , r ( x ) x 2 < 0 for | κ | r 1 / 2 | 1 / 2 | κ | | .              
The last two properties indicate that the ( κ , r ) -logarithm is a strictly monotonically increasing function for | κ | r | κ | and additionally a concave function for | κ | r 1 / 2 | 1 / 2 | κ | | .
Remark 5. 
Relation to the Euler Logarithm: It is interesting to note that for r + k = a and r k = b the KLS κ , r –logarithm can be represented as the Euler logarithm [17,59]:
log a , b E u ( x ) = x a x b a b , x > 0 , a < 0 , 0 < b < 1 ,
which is related to the Borges–Rodity entropy [50,60].
Connection to the Schwämmle–Tsallis logarithm: By applying nonlinear transformation in (103) x exp ( log q ( x ) ) and a = 1 q , b = 0 , we obtain the Schwämmle–Tsallis logarithm (39).
Connection to the Mean Value Theorem: The function has deep connections to the Mean Value Theorem applied to power functions. For the power function g ( t ) = x t , the Mean Value Theorem guarantees the existence of some parameter c ( a , b ) such that g ( c ) = g ( b ) g ( a ) b a , which yields:
log a , b E U ( x ) = x b x a b a = c · x c 1 · ln ( x ) for c ( a , b ) .
Logarithmic Mean Connection: The function relates to the logarithmic mean L ( u , v ) = u v ln u ln v through the substitution u = x a , v = x b . These connections provide alternative computational approaches and theoretical insights.
Exponential Function Theory: The underlying structure connects to exponential function differentiation rules, where d d t x t = x t ln ( x ) , explaining the limiting behavior observed in the analysis.
Computational and numerical considerations: The numerical analysis reveals several important computational aspects:
1. 
Numerical Stability: The KLS logarithm becomes increasingly stable as x 1 , but exhibits potential numerical instability for x values far from unity.
2. 
Parameter Sensitivity: Small x values create higher sensitivity to parameter changes, requiring careful numerical handling.
3. 
Convergence Properties: The limiting behavior requires special computational treatment using L’Hôpital’s rule.

5.4. Exponential KLS Function and Its Properties

With the existence of exp κ , r ( x ) , the inverse function of log κ , r ( x ) follows from its monotonicity in R although an explicit expression, in general, cannot be given. In other words, the inverse function can not be expressed in a closed analytical form but it can be approximated and expressed, for example, in terms of the Lambert–Tsallis W q -functions, which are the solution of equations W q ( z ) [ 1 + ( 1 q ) W q ( z ) ] + 1 / ( 1 q ) = z :
exp κ , r ( x ) = W λ + 1 λ λ x ˜ λ λ 1 / ( 2 κ ) ,
where λ = ( 2 κ ) / ( r + κ ) , x ˜ = 2 κ x and W q is the Lambert–Tsallis function [61].
Another much simpler approach is to use Lagrange’s inversion theorem around 1 to obtain the following rough power series approximation (which may be sufficient for most of our applications):
exp κ , r ( x ) 1 + x + 1 2 ( 1 2 r ) x 2 + 1 6 r + 3 2 r 2 1 6 κ 2 x 3 +
= exp ( x ) r x 2 + 3 2 r 2 r 1 6 κ 2 x 3 + O ( x 4 ) .
Hence, we can represent a ( κ , r ) -exponential as follows:
exp κ , r ( x ) = exp ( x ) r x 2 + 3 2 r 2 r 1 6 κ 2 x 3 for r R , | κ | r | κ | , | κ | < 1 , exp κ K ( x ) = κ x + 1 + κ 2 x 2 1 / κ for r = 0 , κ [ 1 , 1 ] , κ 0 , exp q T ( x ) = 1 + ( 1 q ) x + 1 / ( 1 q ) for r = κ = ( 1 q ) / 2 , q 1 , exp ( x ) for r = κ = 0 .
Furthermore, the ( κ , r ) -exponential function has the following fundamental properties:
exp κ , r ( 0 ) = 1 , exp κ , r ( ) = 0 + , for r < | κ | , exp κ , r ( ) = + , for r > | κ | ,
exp κ , r ( x ) exp κ , r ( x ) = 1 ,                                     
( exp κ , r ( x ) ) λ = exp κ / λ , r / λ ( λ x ) ,                           
exp κ , r ( x ) x > 0 , for | κ | r | κ | ,                       
2 exp κ , r ( x ) x 2 > 0 , for | κ | r 1 / 2 | 1 / 2 | κ | |                  
The last two properties mean that the ( κ , r ) -exponential is a monotonically increasing convex function.
Two notable features of the ( κ , r ) -logarithm and exponential function are that it asymptotically approaches a regular exponential function for small x and asymptotically approaches a power law for large absolute x:
lim x 0 + log κ , r ( x ) 1 2 | κ | x | κ | + r ,
lim x + log κ , r ( x ) x | κ | + r 2 κ ,   
lim x 0 exp κ , r ( x ) exp ( x ) ,
lim x ± exp κ , r ( x ) | 2 κ x | 1 / ( r ± | κ | ) .
By defining
x κ , r y = exp κ , r log κ , r ( x ) + log κ , r ( y )
we have the key formulas for our MD (GEG) implementations
exp κ , r ( x + y ) = exp κ , r ( x ) κ , r exp κ , r K ( y ) ,
exp κ , r ( log κ , r K ( x ) + y ) = x κ , r exp κ , r K ( y ) .
Let us assume that the link function is defined as f ( w ) = log κ , r ( w ) and its inverse (if approximated version is accepted) f ( 1 ) ( w ) = exp κ , r ( w ) , then using a general MD Formula (6), and fundamental properties described above, we obtain a general MD formula employing a wide family of deformed logarithms arising from group entropies or trace-form entropies:
w t + 1 = exp κ , r log κ , r ( w t ) η t L ( w t ) = w t κ , r exp κ , r η t L ( w t ) )
where κ , r -multiplication is defined/determined as follows
x κ , r y = exp κ , r log κ , r ( x ) + log κ , r ( y ) .
Alternatively, due to some complexity of computing precisely exp κ , r ( x ) , in the general case, we can use formula MMD/NG (10) to derive a quite flexible and general NG gradient update:
w t + 1 = [ w t η t diag log κ , r ( w t ) w t 1 L ( w t ) ) ] +
where diag log κ , r ( w t ) w t 1 is a positive-definite diagonal matrix, with the diagonal entries
log κ , r ( w t ) w i , t 1 = 2 κ ( r + κ ) w i , t r + κ 1 ( r κ ) w i , t r κ 1 > 0 , | κ | r | κ | .

6. Generalization and Normalization of Mirror Descent

Summarizing, all of the GEG updates proposed in this paper can be presented in normalized form (by projecting to unit simplex) in the following general and flexible form
             w ˜ t + 1 = w t D exp D η t L ^ ( w t ) ( Generalized multiplicative update ) ,
w t + 1 = w ˜ t + 1 | | w ˜ t + 1 | | 1 , ( Projection to unit simplex )
where exp D ( x ) ( log D ( x ) ) is a generalized exponential (logarithm), L ^ ( w t ) = L ( w t / | | w t | | 1 ) is normalized/scaled loss function, η t is a vector of the learning rates, L ^ ( w t ) = L ( w t ) ( w T L ( w t ) ) 1 and the generalized D-multiplication is computed as
w t D exp D ( g t ) = exp D ( log D ( w t ) + g t ) .
Here, log D ( w t ) and its inverse exp D ( w t ) mean any deformed logarithm and exponential investigated in this paper (i.e., the Tsallis, Kaniadakis, ST, KLS and KS exponential/logarithm).
Alternatively, when the inverse function can not be precisely computed, we can use an MMD/NG additive natural gradient Formula (10), which is expressed in general as
(127) w t + 1 = [ w t η diag d log D ( w t ) d w t 1 w L ^ ( w t ) ] + , (128) w t + 1 = w ˜ t + 1 | | w ˜ t + 1 | | 1 , w t R + N , t
where diag d log D ( w ) d w 1 = diag d log D ( w ) d w 1 1 , , d log D ( w ) d w N 1 is a diagonal positive-definite matrix.

7. Conclusions and Discussion

This study establishes a comprehensive framework for applying trace-form entropies and associated deformed logarithms in both Mirror Descent and equivalently Generalized Exponentiated Gradient algorithms. By systematically exploring trace-form entropies, especially Tsallis, Kaniadakis, Scarfone, and Sharma–Taneja–Mittal forms as regularization terms, we unveil new families of mirror gradient descent algorithms that can be tailored to the optimization landscape through suitably chosen hyperparameters. The adoption of these generalized entropies opens the door to obtaining advantageous properties such as improved convergence rates, robustness against vanishing/exploding gradients, and inherent flexibility for handling non-Euclidean geometries. Table 1 summarizes the main results obtained from our study and lists the generalized exponentiated gradient update induced by the deformed exponential functions corresponding to the deformed logarithms used to define the various trace-form entropies.
The theoretical developments presented not only unify additive and multiplicative gradient update rules via Bregman divergences but also pave the way for designing robust machine learning algorithms that have the ability to adapt precisely to the structure of training data distributions via hyperparameters. Future work will investigate broader classes of entropic functions, extending the framework to non-convex and stochastic optimization settings, applying the proposed approach to practical problems, and performing systematic comparisons through computer simulation experiments.

Author Contributions

Conceptualization, A.C.; writing–original draft, A.C. and T.T.; editing and visualization, S.C. and F.N.; validation, F.N. and S.C.; formal analysis, A.C., T.T., F.N. and S.C.; supervision and project administration, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

S. Cruces was supported in part by the MICIU/AEI/10.13039/501100011033 under Grant PID2021-123090NB-I00, in part by ERDF/EU.

Data Availability Statement

No new data were created or analyzed in this study due theoretical character of this study.

Acknowledgments

We wish to express our sincere appreciation to Shun-ichi Amari for his profound influence and invaluable contributions and scientific collaborations. His expertise proved instrumental in shaping our collective understanding and continues to inspire our future endeavors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Deformed Algebra and Calculus

Appendix A.1. q-Algebra and Calculus

In this section we briefly summarize q–Algebra [42,47]:
  • q-sum: x q y = x + y + ( 1 q ) x y
  • Neutral element of q-sum: x q 0 = 0 q x = x
  • q- substraction: x q y = x y 1 + ( 1 q ) y , y 1 / ( 1 q )
  • q-product: x q y = x 1 q + y 1 q + 1 + 1 / ( 1 q )
  • Neutral element of q-product: x q 1 = 1 q x = x
  • q-division: x q y = x q ( 1 / y ) = x 1 q y 1 q + 1 + 1 / ( 1 q )
  • Inverse of q-product 1 q x = 2 x 1 q + 1 / ( 1 q ) .

Appendix A.2. κ–Algebra and Calculus

In this section we briefly summarize κ –Algebra, especially κ -product properties [36,37,38,39]:
  • κ -sum: x κ y = x 1 + κ 2 y 2 + y 1 + κ 2 x 2 ,
  • Neutral element of κ -sum: x κ 0 = 0 κ x = x ,
  • κ - substraction: x κ y = x 1 + κ 2 y 2 y 1 + κ 2 x 2 ,
  • κ -product: x κ y = exp 1 κ arsinh ( ( x κ x κ + y κ y κ ) / 2 ) ,
  • Admits the unity as a neutral element of κ -product: x κ 1 = 1 κ x = x .
  • κ -product is commutative: x κ y = y κ x
  • κ -product is associative: ( x κ y ) κ z = x κ ( y κ z )
  • The inverse element of x is 1 / x , i.e., x κ ( 1 / x ) = 1
  • κ -division: x κ y = x κ ( 1 / y ) = exp 1 κ arsinh ( ( x κ x κ y κ + y κ ) / 2 )
  • Inverse of κ -product 1 κ x = x 1 .

References

  1. Nemirovsky, A.; Yudin, D.B. Problem Complexity and Method Efficiency in Optimization; John Wiley and Sons: Hoboken, NJ, USA, 1983. [Google Scholar] [CrossRef]
  2. Amid, E.; Warmuth, M.K. Reparameterizing Mirror Descent as Gradient Descent. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; pp. 8430–8439. [Google Scholar]
  3. Amid, E.; Warmuth, M.K. Winnowing with Gradient Descent. In Proceedings of the 33rd International Conference on Algorithmic Learning Theory, PMLR 125, Graz, Austria, 9–12 July 2020; pp. 163–182. [Google Scholar]
  4. Ghai, U.; Hazan, E.; Singer, Y. Exponentiated Gradient Meets Gradient Descent. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR 117, San Diego, CA, USA, 8–11 February 2020; pp. 386–407. [Google Scholar] [CrossRef]
  5. Shalev-Shwartz, S. Online learning and online convex optimization. Found. Trends Mach. Learn. 2011, 4, 107–194. [Google Scholar] [CrossRef]
  6. Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 2003, 31, 167–175. [Google Scholar] [CrossRef]
  7. Amari, S. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
  8. Amari, S. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
  9. Amari, S. Information Geometry and Its Applications: Convex Function and Dually Flat Manifold. In Emerging Trends in Visual Computing; Nielsen, F., Ed.; Springer Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; pp. 75–102. [Google Scholar]
  10. Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
  11. Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. 2010, 58, 183–195. [Google Scholar] [CrossRef]
  12. Amid, E.; Nielsen, F.; Nock, R.; Warmuth, M.K. Optimal transport with tempered exponential measures. Proc. AAAI Conf. Artif. Intell. 2024, 38, 10838–10846. [Google Scholar] [CrossRef]
  13. Raskutti, G.; Mukherjee, S. The information geometry of mirror descent. IEEE Trans. Inf. Theory 2015, 61, 1451–1457. [Google Scholar] [CrossRef]
  14. Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2014, 13, 134–170. [Google Scholar] [CrossRef]
  15. Cichocki, A.; Cruces, S.; Sarmiento, A.; Tanaka, T. Generalized Exponentiated Gradient Algorithms and Their Application to On-Line Portfolio Selection. IEEE Access 2024, 12, 197000–197020. [Google Scholar] [CrossRef]
  16. Kainth, A.S.; Wong, T.-K.L.; Rudzicz, F. Conformal mirror descent with logarithmic divergences. Inf. Geom. 2024, 7 (Suppl. 1), 303–327. [Google Scholar] [CrossRef]
  17. Cichocki, A. Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm. arXiv 2025, arXiv:2502.17500. [Google Scholar] [CrossRef]
  18. Cichocki, A. Mirror Descent Using the Tempesta Generalized Multi-parametric Logarithms. arXiv 2025, arXiv:2506.13984. [Google Scholar]
  19. Helmbold, D.P.; Schapire, R.E.; Singer, Y.; Warmuth, M.K. On-line Portfolio Selection Using Multiplicative Updates. Math. Financ. 1998, 8, 325–347. [Google Scholar] [CrossRef]
  20. Kivinen, J.; Warmuth, M.K. Exponentiated Gradient versus Gradient Descent for Linear Predictors. Inf. Comput. 1997, 132, 1–63. [Google Scholar] [CrossRef]
  21. Kivinen, J.; Warmuth, M.K. Additive Versus Exponentiated Gradient Updates for Linear Prediction. In Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, Las Vegas, NV, USA, 29 May–1 June 1995; pp. 209–218. [Google Scholar] [CrossRef]
  22. Bregman, L. The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Comp. Math. Phys. USSR 1967, 7, 200–217. [Google Scholar] [CrossRef]
  23. Burachik, R.S.; Dao, M.N.; Lindstrom, S.B. The generalized Bregman distance. SIAM J. Optim. 2021, 31, 404–424. [Google Scholar] [CrossRef]
  24. Martinez-Legaz, J.E.; Tamadoni Jahromi, M.; Naraghirad, E. On Bregman-type distances and their associated projection mappings. J. Optim. Theory Appl. 2022, 193, 107–117. [Google Scholar] [CrossRef]
  25. Nielsen, F.; Nock, R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
  26. Nock, R.; Nielsen, F. Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 2048–2059. [Google Scholar] [CrossRef]
  27. Cichocki, A.; Amari, S.I. Families of α-β-and γ-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
  28. Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S.I. Multiplicative Iterative Algorithms for NMF with Sparsity Constraints. In Nonnegative Matrix and Tensor Factorizations; Chapter 3; John Wiley and Sons: Hoboken, NJ, USA, 2009; pp. 131–202. [Google Scholar] [CrossRef]
  29. Cichocki, A.; Zdunek, R.; Amari, S. Csiszár’s Divergences for Nonnegative Matrix Factorization: Family of New Algorithms. In Independent Component Analysis and Signal Separation; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3889, pp. 32–39. [Google Scholar]
  30. Gunasekar, S.; Woodworth, B.; Srebro, N. Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, CA, USA, 13–15 April 2021; Volume 130, pp. 2305–2313. [Google Scholar]
  31. Nielsen, F. A Note on the Natural Gradient and Its Connections with the Riemannian Gradient, the Mirror Descent, and the Ordinary Gradient. Github Report. Available online: https://franknielsen.github.io/blog/NaturalGradientConnections/NaturalGradientConnections.pdf (accessed on 1 August 2020).
  32. Hristopulos, D.T.; da Silva, S.L.E.; Scarfone, A.M. Twenty Years of Kaniadakis Entropy: Current Trends and Future Perspectives. Entropy 2025, 27, 247. [Google Scholar] [CrossRef]
  33. Naudts, J. Deformed exponentials and logarithms in generalized thermostatistics. Phys. A Stat. Mech. Its Appl. 2002, 316, 323–334. [Google Scholar] [CrossRef]
  34. Tsallis, C. Entropy. Encyclopedia 2022, 2, 264–300. [Google Scholar] [CrossRef]
  35. Wada, T.; Scarfone, A.M. Finite difference and averaging operators in generalized entropies. J. Phys. Conf. Ser. 2010, 201, 012005. [Google Scholar] [CrossRef]
  36. Kaniadakis, G.; Scarfone, A.M. A new one-parameter deformation of the exponential function. Phys. A Stat. Mech. Its Appl. 2002, 305, 69–75. [Google Scholar] [CrossRef]
  37. Kaniadakis, G. Statistical mechanics in the context of special relativity. Phys. Rev. E 2002, 66, 056125. [Google Scholar] [CrossRef]
  38. Kaniadakis, G.; Lissia, M.; Scarfone, A.M. Deformed logarithms and entropies. Phys. A Stat. Mech. Its Appl. 2004, 340, 41–49. [Google Scholar] [CrossRef]
  39. Kaniadakis, G.; Lissia, M.; Scarfone, A.M. Two-parameter deformations of logarithm, exponential, and entropy: A consistent framework for generalized statistical mechanics. Phys. Rev. E 2005, 71, 046128. [Google Scholar] [CrossRef]
  40. Tempesta, P. A theorem on the existence of trace-form generalized entropies. Proc. R. Soc. A Math. Phys. Eng. Sci. 2015, 471, 20150165. [Google Scholar] [CrossRef]
  41. Wada, T.; Scarfone, A.M. On the Kaniadakis distributions applied in statistical physics and natural sciences. Entropy 2023, 25, 292. [Google Scholar] [CrossRef]
  42. Gomez, I.S.; Borges, E.P. Algebraic structures and position-dependent mass Schrödinger equation from group entropy theory. Lett. Math. Phys. 2021, 111, 43. [Google Scholar] [CrossRef]
  43. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  44. Tsallis, C. What are the numbers that experiments provide. Quim. Nova 1994, 17, 468–471. [Google Scholar]
  45. Ishige, K.; Salani, P.; Takatsu, A. Hierarchy of deformations in concavity. Inf. Geom. 2024, 7 (Suppl. 1), 251–269. [Google Scholar] [CrossRef]
  46. Box, G.E.P.; Cox, D.R. An Analysis of Transformations. J. R. Stat. Soc. Ser. B 1964, 26, 211–252. [Google Scholar] [CrossRef]
  47. Borges, E.P. A possible deformed algebra and calculus inspired in nonextensive thermostatistics. Phys. A Stat. Mech. Its Appl. 2004, 340, 95–101. [Google Scholar] [CrossRef]
  48. Yamano, T. Some properties of q-logarithm and q-exponential functions in Tsallis statistics. Phys. A Stat. Mech. Its Appl. 2002, 305, 486–496. [Google Scholar] [CrossRef]
  49. Nock, R.; Amid, E.; Warmuth, M.K. Boosting with Tempered Exponential Measures. arXiv 2023, arXiv:2306.05487. [Google Scholar] [CrossRef]
  50. Borges, E.P.; Roditi, I. A family of nonextensive entropies. Phys. Lett. A 1998, 246, 399–402. [Google Scholar] [CrossRef]
  51. Schwämmle, V.; Tsallis, C. Two-parameter generalization of the logarithm and exponential functions and Boltzmann-Gibbs-Shannon entropy. J. Math. Phys. 2007, 48, 113301. [Google Scholar] [CrossRef]
  52. Cardoso, P.G.; Borges, E.P.; Lobao, T.C.; Pinho, S.T. Nondistributive algebraic structures derived from nonextensive statistical mechanics. J. Math. Phys. 2008, 49, 093509. [Google Scholar] [CrossRef]
  53. Corcino, C.B.; Corcino, R.B. Three-Parameter Logarithm and Entropy. J. Funct. Spaces 2020, 2020, 9791789. [Google Scholar] [CrossRef]
  54. Kaniadakis, G. Maximum entropy principle and power-law tailed distributions. Eur. Phys. J. B 2009, 70, 3–13. [Google Scholar] [CrossRef]
  55. Mittal, D.P. On some functional equations concerning entropy, directed divergence and inaccuracy. Metrika 1975, 22, 35–45. [Google Scholar] [CrossRef]
  56. Sharma, B.D.; Taneja, I.J. Entropy of type (α, β) and other generalized measures in information theory. Metrika 1975, 22, 205–215. [Google Scholar] [CrossRef]
  57. Taneja, I.J. On generalized information measures and their applications. Adv. Electron. Electron Phys. 1989, 76, 327–413. [Google Scholar]
  58. Scarfone, A.M.; Suyari, H.; Wada, T. Gauss law of error revisited in the framework of Sharma-Taneja-Mittal information measure. Cent. Eur. J. Phys. 2009, 7, 414–420. [Google Scholar] [CrossRef]
  59. Kaniadakis, G.; Scarfone, A.M.; Sparavigna, A.; Wada, T. Composition law of κ-entropy for statistically independent systems. Phys. Rev. E 2017, 95, 052112. [Google Scholar] [CrossRef]
  60. Furuichi, S. An axiomatic characterization of a two-parameter extended relative entropy. J. Math. Phys. 2010, 51, 123302. [Google Scholar] [CrossRef]
  61. Da Silva, G.B.; Ramos, R.V. The Lambert-Tsallis Wq function. Phys. A Stat. Mech. Its Appl. 2019, 525, 164–170. [Google Scholar] [CrossRef]
Figure 1. Mirror descent is a three-step process: 1. Map parameter with link function to dual space (mirror space), 2. Perform gradient descent in dual space, and 3. Maps back to primal parameter space using the inverse of the link function.
Figure 1. Mirror descent is a three-step process: 1. Map parameter with link function to dual space (mirror space), 2. Perform gradient descent in dual space, and 3. Maps back to primal parameter space using the inverse of the link function.
Entropy 27 01243 g001
Figure 2. Plots of the q-logarithm log q T ( x ) and q-exponential exp q T ( x ) functions for different values of parameter q. From the figure, one can observe how the q parameter controls the degree of concavity/convexity of the q-logarithm as well as the degree of convexity/concavity of the q-exponential. Since the q-logarithm is convex for q < 0 , linear for q = 0 , and strictly concave for q > 0 , particularizing to the classical logarithm for q = 1 .
Figure 2. Plots of the q-logarithm log q T ( x ) and q-exponential exp q T ( x ) functions for different values of parameter q. From the figure, one can observe how the q parameter controls the degree of concavity/convexity of the q-logarithm as well as the degree of convexity/concavity of the q-exponential. Since the q-logarithm is convex for q < 0 , linear for q = 0 , and strictly concave for q > 0 , particularizing to the classical logarithm for q = 1 .
Entropy 27 01243 g002
Figure 3. A 3D plot of the q-logarithm log q T ( x ) . The black continuous line represents the reference of the classical logarithm ln ( x ) , which is obtained for q = 1 .
Figure 3. A 3D plot of the q-logarithm log q T ( x ) . The black continuous line represents the reference of the classical logarithm ln ( x ) , which is obtained for q = 1 .
Entropy 27 01243 g003
Figure 4. Plots of the ( q , q ) -logarithm and ( q , q ) -exponential functions for different values of the parameters in special cases when q = q .
Figure 4. Plots of the ( q , q ) -logarithm and ( q , q ) -exponential functions for different values of the parameters in special cases when q = q .
Entropy 27 01243 g004
Figure 5. Plots of the ( q , q , r ) -logarithm and ( q , q , r ) -exponential functions when the parameters are coincident q = q = r .
Figure 5. Plots of the ( q , q , r ) -logarithm and ( q , q , r ) -exponential functions when the parameters are coincident q = q = r .
Entropy 27 01243 g005
Figure 6. Plots of the κ -logarithm and κ -exponential functions for different values of the parameter κ .
Figure 6. Plots of the κ -logarithm and κ -exponential functions for different values of the parameter κ .
Entropy 27 01243 g006
Figure 7. Surface plots of the κ -logarithm. The black continuous line represents the reference of the classical logarithm ln ( x ) , which is obtained for κ = 0 .
Figure 7. Surface plots of the κ -logarithm. The black continuous line represents the reference of the classical logarithm ln ( x ) , which is obtained for κ = 0 .
Entropy 27 01243 g007
Figure 8. Surface plots of the ( κ , r ) -logarithm for various values of hyperparameters κ and r. The left-hand-side figure illustrates the ( κ , r ) -logarithm in terms of κ and x when r = 0.7 . The right-hand-side figure illustrates the ( κ , r ) -logarithm, now in terms of r and x, when κ = 0.7 . The black dashed line coincides with the κ -logarithm for κ = 0.7 , since in this case r = 0 .
Figure 8. Surface plots of the ( κ , r ) -logarithm for various values of hyperparameters κ and r. The left-hand-side figure illustrates the ( κ , r ) -logarithm in terms of κ and x when r = 0.7 . The right-hand-side figure illustrates the ( κ , r ) -logarithm, now in terms of r and x, when κ = 0.7 . The black dashed line coincides with the κ -logarithm for κ = 0.7 , since in this case r = 0 .
Entropy 27 01243 g008
Figure 9. Surface plots of the ( κ , r ) -logarithm in terms of the hyperparameters κ and r, when x { 0.3 , 0.7 } . From the drawings, it is apparent how the changes of the hyperparameter r have a much stronger influence in the magnification of the response, in comparison with the changes in the hyperparameter κ that correspond with more subtle elongations. The figure on the left-hand-side evaluates the ( κ , r ) -logarithm for x = 0.3 , whereas the figure on the right-hand-side evaluates it for x = 0.7 .
Figure 9. Surface plots of the ( κ , r ) -logarithm in terms of the hyperparameters κ and r, when x { 0.3 , 0.7 } . From the drawings, it is apparent how the changes of the hyperparameter r have a much stronger influence in the magnification of the response, in comparison with the changes in the hyperparameter κ that correspond with more subtle elongations. The figure on the left-hand-side evaluates the ( κ , r ) -logarithm for x = 0.3 , whereas the figure on the right-hand-side evaluates it for x = 0.7 .
Entropy 27 01243 g009
Table 1. Overview of the generalized exponentiated gradient (GEG) updates.
Table 1. Overview of the generalized exponentiated gradient (GEG) updates.
EntropyDeformed ExponentialMD/GEG Update
Shannon exp ( x ) = i = 0 x i i ! w t + 1 = w t exp ( η L ( w t ) ) (EG)
Tsallis exp q T ( x ) = [ 1 + ( 1 q ) x ] + 1 / ( 1 q ) q 1 exp ( x ) q = 1 w t + 1 = exp q T log q T ( w t ) q η w L ( w t )
= w t q exp q T η w L ( w t ) (q-GEG)
Schwämmle-Tsallis exp q , q ST ( x ) = 1 + 1 q 1 q ln ( 1 + ( 1 q ) x ) 1 / ( 1 q ) w t + 1 = w t η diag w q exp 1 q 1 q ( 1 w 1 q w L ( w t ) +
Kaniadakis exp κ K ( x ) = arcsinh ( κ x ) 1 < κ < 1 exp ( x ) κ = 0 . w t + 1 = w t κ exp κ K η w L ( w t ) ( κ - GEG )
KLS exp κ , r ( x ) w t + 1 = w t η t diag log κ , r ( w t ) w t 1 L ( w t ) +
Generic exp D ( x ) w t + 1 = w t η diag d log D ( w t ) d w t 1 w L ^ ( w t ) +
w t + 1 = w ˜ t + 1 | | w ˜ t + 1 | | 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cichocki, A.; Tanaka, T.; Nielsen, F.; Cruces, S. Mirror Descent and Exponentiated Gradient Algorithms Using Trace-Form Entropies. Entropy 2025, 27, 1243. https://doi.org/10.3390/e27121243

AMA Style

Cichocki A, Tanaka T, Nielsen F, Cruces S. Mirror Descent and Exponentiated Gradient Algorithms Using Trace-Form Entropies. Entropy. 2025; 27(12):1243. https://doi.org/10.3390/e27121243

Chicago/Turabian Style

Cichocki, Andrzej, Toshihisa Tanaka, Frank Nielsen, and Sergio Cruces. 2025. "Mirror Descent and Exponentiated Gradient Algorithms Using Trace-Form Entropies" Entropy 27, no. 12: 1243. https://doi.org/10.3390/e27121243

APA Style

Cichocki, A., Tanaka, T., Nielsen, F., & Cruces, S. (2025). Mirror Descent and Exponentiated Gradient Algorithms Using Trace-Form Entropies. Entropy, 27(12), 1243. https://doi.org/10.3390/e27121243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop