Topological Regularization for Representation Learning via Persistent Homology

: Generalization is challenging in small-sample-size regimes with over-parameterized deep neural networks, and a better representation is generally beneﬁcial for generalization. In this paper, we present a novel method for controlling the internal representation of deep neural networks from a topological perspective. Leveraging the power of topology data analysis (TDA), we study the push-forward probability measure induced by the feature extractor, and we formulate a notion of “separation” to characterize a property of this measure in terms of persistent homology for the ﬁrst time. Moreover, we perform a theoretical analysis of this property and prove that enforcing this property leads to better generalization. To impose this property, we propose a novel weight function to extract topological information, and we introduce a new regularizer including three items to guide the representation learning in a topology-aware manner. Experimental results in the point cloud optimization task show that our method is effective and powerful. Furthermore, results in the image classiﬁcation task show that our method outperforms the previous methods by a signiﬁcant margin.


Introduction
Although over-parameterized deep neural networks generalize well in practice when sufficient data are provided, in small-sample-size regimes, generalization is more difficult and requires careful consideration.Since the ability to learn task-specific representations is beneficial for generalization, a lot of effort has been dedicated to imposing structure on the latent space or representation space via additional regularizers [1,2], to guide the mapping from input space into internal space, or control properties of the internal representations [3].However, internal representations are high-dimensional, discrete, sparse, incomplete and noisy; extraction of information from this kind of data is rather challenging.
In order to explore and control internal representations, there are various ways to choose: (1) For the algebraic methods based on vector space [4], coordinates are not natural, the power of linear transformation is limited, and low-dimensional visualizations cannot faithfully characterize the data.(2) For the statistical methods [5], a small sample size limits the power of the analysis, inference and computation; asymptotic statistics cannot be used; and the results may exhibit large variance.(3) For geometric methods based on distances [6] or manifold assumptions [7], it is difficult to capture the global picture, and metrics are not theoretically justified [8] (for neural networks, notions of distance are constructed by the feature extractor, which is hard to understand).Individual parameter choices may significantly influence the results, and constraints based on geometry information, such as pairwise distances, may be too strict to respect reality.(4) Methods based on calculus [9,10] can capture the local information in a small neighborhood for each point, but the performance is questionable while high-dimensional data are sparse and the sample size is small.
Besides the above traditional methods, there is another fundamentally different perspective, an unexplored, powerful tool to employ, i.e., the topological data analysis (TDA) method.The advantages of TDA methods [8,[11][12][13] are as follows: (1) TDA studies the global "shape" of data and explores the underlying topological and geometric structures of point clouds.As a complement to localized and generally more rigid geometric features, topological features are suitable for capturing multi-scale, global and intrinsic properties of data.(2) Topology methods study geometric features in a way that is less sensitive to the choice of metrics, and this insensitivity is beneficial when the metric is not well understood or only determined in a coarse way [8], as in the neural networks.(3) Topology methods are coordinate-free, and they only focus on the intrinsic geometric properties of the geometric objects.(4) Instead of determining a proper spatial scale to understand and control the data, persistent homology collects information over the whole domains of parameter values and creates a summary in which the features that persist over a wide range of spatial scales are considered more likely to represent true features of the underlying space rather than artifacts of sampling, noise or particular choice of parameters.

Related Works
Previous work related to our work can be divided into two categories.The first category focuses on regularization using statistical information of internal representations.Cogswell et al. [1] proposed a regularizer to encourage diverse or non-redundant representations by minimizing the cross-covariance of internal representation.Choi et al. [2] designed two class-wise regularizers to enforce the desired characteristic for each class; one focused on reducing the covariance of the representations for samples from the same class, and the other used variance instead of covariance to improve compactness.The second category studies deep neural networks using tools from algebraic topology, in particular, persistent homology.Brüel-Gabrielsson et al. [14] presented a differentiable topology layer to extract topological features, which can be used to promote topological structure or incorporate a topological prior via regularization.Kim et al. [15] proposed a topological layer for generative deep models to feed critical topological information into subsequent layers and provided an adaptation for the distance-to-measure (DTM) function-based filtration.Hajij et al. [16] defined and studied the classification problem in machine learning in a topological setting and showed when the classification problem is possible or not possible in the context of neural networks.Li et al. [17] proposed an active learning algorithm to characterize the decision boundaries using their homology.Chen et al. [18] proposed measuring the complexity of the classification boundary via persistent homology, and the topological complexity was used to control the decision boundary via regularization.Vandaele et al. [19] introduced a novel set of topological losses to topologically regularize data embeddings in unsupervised feature learning, which can efficiently incorporate a topological prior.Hofer et al. [20] considered the problem of representation learning, treated each mini-batch as a point cloud, and controlled the connectivity of latent space via a novel topological loss.Moor et al. [3] extended this work and proposed a loss term to harmonize the topological features of the input space with the topological features of the latent space.This approach also acts on the level of mini-batches, computes persistence diagrams for both input space and latent space, and encourages these two persistence diagrams (PD) to be similar by the regularization item.Wu et al. [21] explored the rich spatial behavior of data in the latent space, proposed a topological filter to filter out noisy labels, and theoretically proved that the method is guaranteed to collect clean data with high probability.These works show empirically or theoretically that enforcing a certain topological structure on representation space can be beneficial for learning tasks.
Hofer et al. [22] proposed an approach to regularize the internal representation to control the topological properties of the internal space, and proved that this approach would enforce mass concentration effects which are beneficial for generalization.However, the authors based their work on the assumption that a loss function that yields a large margin in the representation space should be selected and, therefore, that the mass concentration effect is only beneficial if the reference set is located sufficiently far away from the decision boundary, but this large margin assumption may be violated in practice.

Contribution
In this paper, we apply the TDA method, in particular, persistent homology from algebraic topology, to analyze and control the global topology of the internal representations of the training points, which reveals the intrinsic structure of the representation space.
When TDA is combined with statistics, data are deemed to be generated from some unknown distribution instead of some underlying manifold.TDA methods are used to infer topological features of the underlying distribution, especially the support of the dis-tribution.Inspired by [22], by combining statistics with topological data analysis, our work focuses on the probability measure induced by the feature extractor and treats the representation of the training points in a mini-batch as point cloud data from which the topological information is extracted, and then we compute persistence diagram of the persistent homology.Specifically, we consider the topological properties of the samples from the product measure of two classes to enforce intra-class mass concentration and separation between classes simultaneously.We extend the definitions and techniques in [22] to formalize the separation between two classes via persistent homology.We argue that if this separation property is encouraged, then both mass concentration and separation will be enforced, and we proposed a novel weight function and constructed a novel loss to control the topological properties of the representation space.
In summary, our contributions are as follows: (1) We characterize a separation property between two classes in representation space in terms of persistent homology (Section 3.2).
(2) We prove that a topological constraint on the samples of the push-forward probability measure in the presentation space leads to mass separation (Section 3.2).
(3) We propose a novel weight function based on DTM.Using our weight function, the weighted Rips filtration can be built on top of training samples from class pairs in a mini-batch.The stability of the persistence diagram with respect to the proposed weight function is presented (Section 3.4).
(4) We propose three regularization items, including a birth loss, a margin loss and a length loss, which operate on a persistence diagram obtained via persistent homology computations on mini-batches, to encourage mass separation (Section 3.4).
The remainder of this paper is structured as follows: In Section 2, we present some topological preliminaries relevant to our work.Section 3 gives our main results, including the separation property, the weight function and the regularization method.Section 4 shows the experimental results on synthetic data and benchmark datasets.Finally, Section 5 gives the conclusion.

Topological Preliminaries
Generally, in topological data analysis, the point clouds are thought to be finite samples taken from an underlying geometric object.To extract topological and geometric information, a natural way is to "connect" data points that are close to each other to build a global continuous shape on top of the data.This section contains a brief introduction to the relevant topological notions.More details can be found in several excellent introductions and surveys [8,[11][12][13]23].

Simplicial Complex, Persistent Homology and Persistence Diagrams
Simplicial complex K is a discrete structure built over a finite set of samples to provide a topological approximation of the underlying topology or geometry.The Čech complex and the Vietoris-Rips complex are widely used in TDA.Below, for any x ∈ X and r > 0, let B(x; r) be the open ball of radius r > 0 centered at x. Definition 1 ( Čech complex [11]).Let X ⊂ X be finite and r > 0. The Čech complex C(X; r) is the simplicial complex {x 0 , . . ., Definition 2 (Vietoris-Rips complex [11]).Let X ⊂ X be finite and r > 0. The Vietoris-Rips complex R(X; r) is the simplicial complex {x 0 , . . . ,x k } ∈ R(x; r) :⇔ B(x i ; r) ∩ B(x j ; r) = ∅ for any i, j ∈ {0, . . . ,k} ⇔ d(x i , x j ) ≤ 2r for any i, j ∈ {0, . . . ,k} For a simplicial complex K, the k-th homology group of K is used to characterize k-dimensional topological features of K, denoted by H k (K).The k-th Betti number of K is the dimension β k (K) = dimH k (K) of the vector space H k (K).The k-th Betti number counts the number of k-dimensional features of K.For example, dimH 0 (K) counts the number of connected components, dimH 1 (K) counts the number of holes, and so on.
For Definitions 1 and 2, it is difficult to choose a proper r without prior domain knowledge.The main insight of persistent homology is to compute topological features of a space at different spatial resolutions.In general, the assumption is that features that persist for a wide range of parameters are "true" features.Features persisting for only a narrow range of parameters are presumed to be noise.
A filtration of a simplicial complex K is a collection of subcomplexes approximating the data points at different spatial resolutions, formally defined as follows: Definition 3 (Filtration [11]).Let K be a simplicial complex, and T ⊆ R. A family of subcomplexes (K t ) t∈T of K is said to be a filtration of K if it satisfies (1) K s ⊆ K t for s ≤ t; (2) ∪ t∈T K t = K Given ε ≥ 0, two filtrations (V t ) t∈T and (W t ) t∈T of E = R d are ε-interleaved [24] if for every t ∈ T , V t ⊆ W t+ε and W t ⊆ V t+ε .The interleaving pseudo-distance between (V t ) t∈T and (W t ) t∈T is defined as the infimum of such ε: Let X be a finite point set in X = R d and t ∈ R. The family {C(X; t)} t forms Čech filtration for t ≥ 0, and the family {R(x; t)} t forms Rips filtration.Since the Čech complex is expensive to compute, Rips filtration is less expensive to compute than Čech filtration and is frequently used to investigate the topology of the point set X.
In the construction of Čech filtrations, the radii of balls increase uniformly.We can also make radii increase non-uniformly.Let By modifying the definition of C(X; t), we define a simplicial complex C f (X; t) by {x 0 , . . ., For each fixed t, C f (X; t) is a Čech complex.The family C f (X; t) t forms a filtration, which is called the weighted Čech filtration.We can also construct the weighted Rips filtration in a similar way.
For a filtration and each non-negative k, we keep track of when k-dimensional homological features appear and disappear in the filtration.If a homological feature α i appears at b i and disappears at d i , then we say α i is born at b i and dies at d i .By considering these pairs (b i , d i ) as points in the plane, we obtain the persistence diagram.Definition 4 (Bottleneck distance [15]).Given two persistence diagrams D and D , their bottleneck distance (d b ) is defined by where • ∞ is the usual L ∞ -norm, D = D ∪ Diag and D = D ∪ Diag with Diag being the diagonal {(x, x) : x ∈ R} ⊂ R 2 with infinite multiplicity, and the set Γ consists of all the bijections γ : D → D .

DTM Function
Despite strong stability properties, distance-based methods in TDA, such as the Čech or Vietoris-Rips filtrations, are sensitive to outliers and noise.To address this issue, [24] introduced an alternative distance function, i.e., the DTM function.Details of DTM-based filtrations are studied in [25].We only list the properties of the DTM that will be used here.
Let µ be a probability measure over R d , and m ∈ [0, 1) a parameter.For every x ∈ R d , let δ µ,m be the function defined on R d by δ µ,m (x) = inf r ≥ 0, µ(B(x, r)) > m .
Definition 5 (Distance-to-measure [24]).The distance-to-measure function (DTM) with parameter m ∈ [0, 1) and power p is the function d µ,m,p : R d → R defined by 1/p (7) and if not specified, p = 2 is used as a default and omitted.
In practice, the measure µ is usually unknown, and we only have a finite set of samples X = {x 1 , . . . ,x n }; a natural idea to estimate an approximation of the DTM from X is to plug the empirical measure µ n instead of µ in Definition 5, to obtain the "distance to the empirical measure (DTEM)".For m = k/n, the DTEM satisfies where x − x n (j) denotes the distance between x and its j-th neighbor in {x 1 , . . . ,x n }.This quantity can be easily computed in practice since it only requires the distances between x and the sample points.

Topological Regularization
Let X be the input space, Y the label space and Z the internal representation space before the classifier.Assuming there are C classes, we formulate the neural network as a compositional mapping: η • ϕ : X → Y = [C] = {1, . . . ,C} , where ϕ : X → Z represents a feature extractor and η : Z → Y represents a classifier that maps the internal representation to the predicted label.Assume the representation space Z is equipped with a metric d.Let P be the probability measure on X and Q be the push-forward probability measure induced by ϕ : X → Z on the Borel σ-algebra ∑ defined by d on Z.
We focus on the internal representation space; in particular, we study the push-forward probability measure Q induced by the feature extractor ϕ on Z, identify a property of Q that is beneficial for generalization and propose a regularization method to implement the property.

Push-Forward Probability Measure and Generalization
Let c : supp(P) → Y represent the deterministic mapping from the support of P to the label space, and S = {(x 1 , y 1 ), . . ., (x m , y m )} be a training sample, where {x 1 , . . . ,x m } are m i.i.d.draws from X ∼ P, and y i = c(x i ).
For a neural network h : X → Y and X ∼ P, we define the generalization error by To study the property of Q, we consider the class-specific probability measure as in [22], define the restriction of Q (i.e., the push-forward of P via ϕ) to class k by where If the probability mass of class k's decision region, measured via Q k , tends towards one, it may lead to better generalization.Reference [22] formulated this notion by establishing a direct link between Q k and the generalization error.

Proposition 3 ([22]). For any class
Proposition 3 links generalization to a condition depending on Q. Intuitively, increasing the probability of ϕ mapping a sample of class k into the correct decision region can improve generalization.
Based on this observation, [22] introduced the definition of a β-connected set to characterize connectivity via 0-dimensional (Vietoris-Rips) persistent homology and proved that a corresponding property for probability measure Q k would be beneficial for generalization.Definition 6 (β-connected [22]).Let β > 0. A set M ⊆ Z is β-connected iff all 0-dimensional death-times of its Vietoris-Rips persistent homology are in the open interval (0, β).
However, [22] assumed a large margin in representation space which may be violated in deep neural networks.In the following, we extend their work and identify a property for probability measure Q that can enhance the separation between classes.
Note that we can also write Proposition 3 in an alternate form, because the probability mass of all classes' decision region measured via Q k sums to one, we have 1 , which means that for each class k, the sum of the probability mass of other classes' decision region, measured via Q k , tends towards zero.Intuitively, decreasing the probability of ϕ mapping a sample of class k into other incorrect decision regions can improve generalization.
Therefore, in order to decrease Q j (D i ), we take class pairs Q i and Q j into consideration and formulate a notion of separation in terms of persistent homology as follows.

Probability Mass Separation
In this section, we show that a certain topological constraint on the (Q i , Q j ) pair will lead to probability mass separation.More precisely, given a reference set In order to enforce the separation between two classes, we extend Definition 6 to characterize the separation between two sets: we denote the death-times of M's 0-dimensional Vietoris-Rips persistent homology as {d i } i∈I 0 and order the indexing of points by decreasing lifetimes, i.e., d i ≥ d j for i < j.Then, we state M 1 and M 2 are (β, γ)-separated, if and only if the following two conditions are satisfied: (1) M 1 and M 2 are both β-connected; (2 Then we use this notion to capture the concentration and separation of a For two classes C 1 and C 2 , consider the restriction of Q to C 1 and C 2 , i.e., Q 1 and [22], when p is fixed, we can lower bound q.In the following, we will provide an approach to upper bound s, which hints at mass separation between different classes. For , consider the distribution of z i,j among M l•β and N. Let n 1 and n 2 be the numbers of z 1,i 's and z 2,i 's that fall within Thus, we define events that (z 1,1 , . . .
In the following lemma, we compute the probability of event E and derive some useful properties.
Then the probability of E can be expressed in terms of q and s as follows: Proof.
For argument (1), we fix q = q 0 and write To study the monotonicity properties of Φ(q 0 , •; b), it is sufficient to consider A(s).
We define two auxiliary functions: Then we have a n 2 (s), and that Hence, Consequently, A(s) is monotonically increasing, and thus, so is Φ(q 0 , •; b).For argument (2), the proof is similar and omitted.Now we can derive the main theorem: Proof.The left side includes all the events that violate the separation assumption, and the right side is only a special case among them.Therefore, by combining Definition 8 and Proposition 4, we complete the proof.

Ramifications of Theorem 1
According to Theorem 1 and Proposition 4, if M l•β covers a certain mass of Q 1 , we can upper bound the mass it covers of Q 2 , i.e., because Φ(q, s; b) is bounded by 1 − c β,γ and is monotonically increasing in q and s, ) should be less than some s 0 .This is beneficial for generalization if M l•β is constructed from the representations of the correctly classified training instances to include some minimal mass of C 1 .
Assuming that the mass of the reference set p = Q 1 (M) is fixed, noting that our Definition 7 is stronger than the mass concentration condition in [22] and then letting Let R b,c β,γ (p, l) = minA 1 be the smallest mass in the l • β extension, and then q ≥ R b,c β,γ (p, l) ≥ p.
By Theorem 1, we have and thus A 2 is non-empty.Now let T b,c β,γ (q) = maxA 2 identify the largest mass in the l • β extension for which the inequality holds.As q increases, T b,c β,γ (q) decreases; therefore, As Ψ is monotonically decreasing in q, R b,c β,γ (p) is monotonically increasing in c β,γ ; furthermore, as Φ is monotonically increasing in s, T b,c β,γ (q) is monotonically decreasing in c β,γ .These facts motivate our regularization goal of increasing c β,γ .In other words, increasing c β,γ would both boost mass concentration within a class and enforce mass separation between two classes.
Suppose M is constructed mainly by training samples from C 1 , i.e., we choose x 0 and r 0 to include many training samples from C 1 ; then we can ask the following: how much mass of C 1 should M contain at least to boost the separation?
We plot Φ(q, q; b) in Figure 1, and we can see that q = s at point (0.049, 0.049), i.e., Φ(q, q; b) = 1 − c β,γ .At this point, the mass separation effect starts to occur.To satisfy Inequality (21), when q increases, s should decrease, which means that as M l•β covers more mass of Q 1 , it covers less mass of Q 2 .In addition, as the batch size b increases, the least mass of Q 1 that M l•β should cover decreases.
In Figure 2, we fix the batch size to 8 and visualize Φ(q, s; b) as a function of s (for different values of q), and we can see that when q increases, T b,c β,γ (q) moves towards zero, which indicates a smaller s, i.e., M l•β covers less mass of Q 2 , and therefore, it leads to a better separation.In Figure 2, we fix the batch size to 8 and visualize ( , ; ) b  q s as a function o different values of q ), and we can see that when q increases,    In Figure 2, we fix the batch size to 8 and visualize ( , ; ) b  q s as a function of different values of q ), and we can see that when q increases,  In Figure 3, we plot , , ( ) , shifts towards a smaller value, which indicates that a better sep between classes C1 and C2 is achieved.Points at which 1 − c β,γ = Φ(q, s; b) holds are marked by dots.When q increases, T b,c β,γ (q) moves towards zero, which indicates a smaller s, i.e., M l•β covers less mass of Q 2 .
In Figure 3, we plot T b,c β,γ (q) as a function of q for different values of c β,γ , where q = Q 1 (M l•β ).As c β,γ is increased, the maximal mass of Q 2 contained in M l•β , characterized by T b,c β,γ (q), shifts towards a smaller value, which indicates that a better separation between classes C 1 and C 2 is achieved.) decreases, i.e., better separation is achieved., also shifts towards a smaller value, which indicates a better separat in order to achieve separation, M only needs to cover a small mass of 1 Q .) , and different values of b).For a fixed p , as the bat increased, the maximal mass of 2

Weighted Rips Filtration and Regularization
In Section 3.2, we show that a topological constraint on a ( , ) probability mass concentration and separation.To impose this constraint, we pr function that is used to construct the filtration, and then we compute the 0-dime persistent diagram and construct the loss item to regularize the internal represen Our method acts on the level of mini-batches; we construct each mini-batc collection of n sub-batches, i.e., B = (B1,…, Bn), as in [22].Each sub-batch consists o ples from the same class, and thus the resulting mini-batch B contains n*b samp regularizer consists of three items and penalizes deviations from a ( , )

A Weight Function for Weighted Rips Filtration
To construct a proper filtration to deal with samples from two different cla define a function f : , , , , , where T is the temperature that controls the magnitude and , ( ) is the DTM f defined in Equation (7).
Considering the mass separation for two classes, we denote the data instances k by k S , and then for a class pair ( , ) i j C C , the training samples can be written as , i j i j S S S   .Let i Q and j Q be the restriction of Q (i.e., the push-forward o  ) to classes i and j, respectively.In order to construct filtration with Equation ( 24we need to compute , , , ( )

Weighted Rips Filtration and Regularization
In Section 3.2, we show that a topological constraint on a (Q i , Q j ) pair would lead to probability mass concentration and separation.To impose this constraint, we propose a function that is used to construct the filtration, and then we compute the 0-dimensional persistent diagram and construct the loss item to regularize the internal representation.
Our method acts on the level of mini-batches; we construct each mini-batch B as a collection of n sub-batches, i.e., B = (B 1 , . . ., B n ), as in [22].Each sub-batch consists of b samples from the same class, and thus the resulting mini-batch B contains n*b samples.Our regularizer consists of three items and penalizes deviations from a (β, γ)-separated arrangement of (z 1,1 , . . .z 1,b , z 2,1 , . . .z 2,b ) for all sub-batch pairs (B i , B j ).

A Weight Function for Weighted Rips Filtration
To construct a proper filtration to deal with samples from two different classes, we define a function f : where T is the temperature that controls the magnitude and d µ,m (x) is the DTM function defined in Equation (7).
Considering the mass separation for two classes, we denote the data instances of class k by S k , and then for a class pair (C i , C j ), the training samples can be written as follows: S i,j = S i ∪ S j .Let Q i and Q j be the restriction of Q (i.e., the push-forward of P via ϕ) to classes i and j, respectively.In order to construct filtration with Equation ( 24), firstly, we need to compute f Q i ,Q j ,m,T (x) for x ∈ S i,j .Note that d Q i ,m and d Q j ,m can be computed with Equation ( 9), the DTEM, where Q i is approximated by ϕ(S i ) and Q j is approximated by ϕ(S j ).According to Equation (24), for a good classifier, points from class i should have smaller function values than points from class j.Then we plug f Q i ,Q j ,m,T (x) into Equation ( 4) to compute the weighted Rips filtration (we set p = 1 for Equation ( 4) in this research), and obtain the 0-dimensional persistence diagram, i.e., the multi-set of intervals for homology in dimension 0, PD 0 (X i,j ) = {(b k , d k )} k∈I 0 .After that, we order the indexing of points by decreasing lifetimes as done in Definition 7; we will use them later to construct the loss item in Section 3.4.3.

Stability
In this section, we establish the stability results for our weight function in Equation (24).In Theorem 2, the stability of the weight function is given, which will later be used in Theorem 3 to ensure the stability of the filtration V[X, f ] with respect to the weight function f .Proposition 5 is used to ensure the stability of the filtration V[X, f ] with respect to X.According to persistent homology theory, the stability results for the filtration translate as stability results for the persistence diagrams.We present our main stability results in Theorem 3.
In Proposition 5, we consider the stability of the filtration with respect to X.For brevity, the subscripts of f are omitted.

Proof. It suffices to show that for every
For z ∈ V t [X, f ], there exists x ∈ X such that z ∈ B f (x, t), i.e., x − z ≤ r x (t).From the hypothesis d H (X, Y) ≤ ε, there exists y ∈ Y such that x − y ≤ ε.Then we need to prove that z ∈ B f (y, t + k), i.e., z − y ≤ r y (t + k).
According to triangle inequality, z − y ≤ z − x + x − y ≤ r x (t) + ε.Then it suffices to show that r x (t) + ε ≤ r y (t + k).
Using Equation ( 4), we have According to Proposition 1, the DTM function is 1-Lipschitz, and then Therefore, In the following theorem, we combine the above results to establish the stability of the persistence diagram with respect to X and f .Theorem 3. Consider four measures µ 1 ,µ 2 ,ν 1 and ν 2 on R d with compact supports X 1 ,X 2 ,Y 1 and Y 2 , respectively.Let Proof.Under some regularity conditions, , where d i denotes the interleaving pseudo-distance between two filtrations as defined in Equation (3).
We use the triangle inequality for the interleaving distance: For the first part (1) on the right side of Equation ( 30), it can be seen that from Proposition 5, we have For the second part (2) on the right side of Equation ( 30), according to Proposition 3.2 in [25], we have Using Theorem 2, we have By combining part (1) and part (2), we complete the proof.

Regularization via Persistent Homology
We split the persistence intervals obtained in Section 3.4.1 into two subsets: where (b i,k , d i,k ) k∈I 0,i consists of the intervals in which the birth time belongs to class i and (b j,k , d j,k ) k∈I 0,j consists of the intervals in which the birth time belongs to class j.Now we define three loss items: Birth loss is designed to measure intra-class distance, in order to meet the first requirement of Definition 6, to enforce intra-class mass concentration: where b 0 and b 1 are super parameters used to control the birth time for each class.

• Margin loss
Margin loss is designed to measure the "distance" between two classes.There may be connected components that appear due to points from class i but disappear due to points from class j, these cases should be penalized.In addition, for class j, the longest interval in (b j,k , d j,k ) k∈I 0,j would finally merge into class i's intervals.
Let b j,min = min b j,k : k ∈ I 0,j ; we define which means that we penalize the margins b j,min − d i,k smaller than γ, where γ is also a super parameter used to control inter-class separation.

• Length loss
Weighted Rips filtration is not as direct as the Rips filtration in controlling distances.Therefore, the length loss can be used in combination with the birth loss to penalize large intra-class distances.In addition, we hope that the two classes correspond to two connected components, which will persist for a wide range of parameters until these two components finally merge when the parameter reaches a sufficiently large value.We also want to prevent Q k from becoming overly dense.To formulate this intuition, we define where β is a super parameter.Finally, our regularization item can be written as follows: where the weightings λ 1 , λ 2 and λ 3 can be set such that the range of the loss is comparable, in range, to the cross-entropy loss, or can be selected via cross-validation.

Experiments
In this section, we test our idea with some experiments.We first consider point cloud optimization to obtain some intuition on the behavior of (35), and then we evaluate our approach on the image classification task.

Point Cloud Optimization
To validate our approach, as an illustrative example, we perform point cloud optimization with only the proposed loss in Equation (35), without other loss items.Point clouds are generated from Gaussian mixture distribution, and we assume that these points are from two different classes: C 1 and C 2 .In the following figures, purple points represent samples from C 1 , and red points represent samples from C 2 .

Gaussian Mixture with Two Components
To test the separation effect, we set the centers of the two components to (−0.6, 0) and (0.6, 0), and the covariance matrix is set to the identity matrix.For parameters in the weight function of Equation ( 24), we set m = 0.1 and 1/T = 0.05.For super parameters in the three loss items, we set b 0 = 0, b 1 = 1.0, β = 0 and γ = 0.6.To encourage clustering and speed up optimization, we adopt a dynamic m update scheme, i.e., gradually increasing m during training.The weightings of the three loss functions are set to 9, 1 and 0.3, respectively, and are chosen by gradient information in the first epoch.
Figure 6a shows the initial position of the points; the purple points are sampled from C 1 , and the red points are sampled from C 2 .Figure 6b shows the final position after 5000 epochs; the mass concentration and separation effects are obvious, and the points from the two classes are well separated.
epochs; the mass concentration and separation effects are obvious, and the points from the two classes are well separated.
Since the weight function ( ) f x may lead to imbalanced point configurations for the two classes, we can address this issue by changing the order of the two sets of points alternatively when feeding data to the computation of persistent homology during training.For vision datasets, because our regularization is used together with cross-entropy loss and a stochastic mini-batch sampling scheme, the imbalance will be compensated automatically.
(a) (b) Figure 7 compares the persistence diagrams before and after training; each green point represents a (bi, di) pair, the green point with y = inf represents the final merged single component left when the filtration value is sufficiently large, and the green point with y > 6 tells us that at this value, the two components merge, i.e., one component disappears and merges into the other component which is generated at an earlier time.In Figure 7, we can see that after 5000 epochs, the two subsets mentioned in Section 3.4.3that correspond to the two classes are well separated, i.e., two connected components can be Since the weight function f (x) may lead to imbalanced point configurations for the two classes, we can address this issue by changing the order of the two sets of points alternatively when feeding data to the computation of persistent homology during training.For vision datasets, because our regularization is used together with cross-entropy loss and a stochastic mini-batch sampling scheme, the imbalance will be compensated automatically.
Figure 7 compares the persistence diagrams before and after training; each green point represents a (b i , d i ) pair, the green point with y = inf represents the final merged single component left when the filtration value is sufficiently large, and the green point with y > 6 tells us that at this value, the two components merge, i.e., one component disappears and merges into the other component which is generated at an earlier time.In Figure 7, we can see that after 5000 epochs, the two subsets mentioned in Section 3.4.3that correspond to the two classes are well separated, i.e., two connected components can be identified in the persistence diagram.We can also see that the points from the same class are concentrated.

Gaussian Mixture with Four Components
As a more challenging example, we consider a Gaussian mixture with four components.We suppose that they represent samples from two different classes, i.e., each class corresponds to two components.For each class, we hope the corresponding two components can merge.To achieve this goal, samples from one class have to travel across samples from the other class, which may cause the loss to increase.Therefore, to obtain optimal results, the optimizer needs to climb the mountain in the loss landscape before it arrives at a valley.
Figure 8 visualizes the points before and after training.Figure 8a shows the initial position of the points.The purple points are sampled from C1, and the red points are sam-

Gaussian Mixture with Four Components
As a more challenging example, we consider a Gaussian mixture with four components.We suppose that they represent samples from two different classes, i.e., each class corresponds to two components.For each class, we hope the corresponding two compo-nents can merge.To achieve this goal, samples from one class have to travel across samples from the other class, which may cause the loss to increase.Therefore, to obtain optimal results, the optimizer needs to climb the mountain in the loss landscape before it arrives at a valley.
Figure 8 visualizes the points before and after training.Figure 8a shows the initial position of the points.The purple points are sampled from C 1 , and the red points are sampled from C 2 .We can see that after 1800 epochs, for each class, the two components merge into a single connected component, as shown in Figure 8b.
spond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).4.1.2.Gaussian Mixture with Four Components As a more challenging example, we consider a Gaussian mixture with four components.We suppose that they represent samples from two different classes, i.e., each class corresponds to two components.For each class, we hope the corresponding two components can merge.To achieve this goal, samples from one class have to travel across samples from the other class, which may cause the loss to increase.Therefore, to obtain optimal results, the optimizer needs to climb the mountain in the loss landscape before it arrives at a valley.
Figure 8 visualizes the points before and after training.Figure 8a shows the initial position of the points.The purple points are sampled from C1, and the red points are sampled from C2.We can see that after 1800 epochs, for each class, the two components merge into a single connected component, as shown in Figure 8b. Figure 9 visualizes the function values calculated by our weight function (24); these values are used to construct the weighted Rips filtration using Equation (4) to extract topological information, and finally, the topological information is used by the regularizer to guide the optimization.Figure 9a visualizes the function values and the contour lines at epoch 0; it can be seen that larger values are assigned for points from C 2 .Figure 9b visualizes the function values and the contour lines after 1800 epochs.Figure 9 visualizes the function values calculated by our weight function (24); these values are used to construct the weighted Rips filtration using Equation (4) to extract topological information, and finally, the topological information is used by the regularizer to guide the optimization.Figure 9a visualizes the function values and the contour lines at epoch 0; it can be seen that larger values are assigned for points from C2. Figure 9b  Similar to Figure 7, Figure 10 compares the persistence diagrams before and after training.Figure 10b shows that after 1800 epochs, two connected components can be identified, and the mass concentration and separation effect is obvious.Similar to Figure 7, Figure 10 compares the persistence diagrams before and after training.Figure 10b shows that after 1800 epochs, two connected components can be identified, and the mass concentration and separation effect is obvious.Similar to Figure 7, Figure 10 compares the persistence diagrams before and after training.Figure 10b shows that after 1800 epochs, two connected components can be identified, and the mass concentration and separation effect is obvious.

Gaussian Mixture with Nine Components
In Figures 11 and 12, we present the results for a Gaussian mixture with nine components.Figure 11a shows the initial position of the points sampled from two classes, while Figure 11b shows the results after 12,500 epochs.Figure 12 compares the persistence diagrams before and after training; it can be seen that our method achieves an effective performance in separating samples from two different classes.In Figures 11 and 12, we present the results for a Gaussian mixture with nine components.Figure 11a shows the initial position of the points sampled from two classes, while Figure 11b shows the results after 12,500 epochs.Figure 12 compares the persistence diagrams before and after training; it can be seen that our method achieves an effective performance in separating samples from two different classes.

Datasets
In this part, we use the same models and settings as [22], and we evaluate our method on three vision benchmark datasets: MNIST [26], SVHN [27] and CIFAR10 [28].For MNIST and SVHN, 250 instances are used for training the model, for CIFAR10, 500 and 1000 instances are used.

Datasets
In this part, we use the same models and settings as [22], and we evaluate our method on three vision benchmark datasets: MNIST [26], SVHN [27] and CIFAR10 [28].For MNIST and SVHN, 250 instances are used for training the model, for CIFAR10, 500 and 1000 instances are used.CNN-13 [29] architecture is employed for CIFAR10 and SVHN.For MNIST, a simpler CNN architecture is employed.We use a stochastic gradient descent (SGD) optimizer with a momentum of 0.9, and the cosine annealing learning rate scheduler [30] is employed.
With the cross-entropy loss, the weighting of our regularization term is set such that the loss in Equation ( 35) is comparable to the cross-entropy loss.In our experiments, each batch contains n = 8 sub-batches, and the sub-batch size is set to b = 16; thus, the total batch size is 128.
During training, for each epoch, we select the 10 most significant channels dynamically for each class to perform topological computation; the criterion for channel selection is similar to that in [31].To compensate for the imbalance between two classes induced by the weight function, we use the ratio of the derivative to weight the two items in the birth loss (Equation (32)).In order to meet the stability requirements of topological computation, we use 0.001 as the minimal differentiable distance between points.The weighting of our regularization term is set to 0.001.For parameters in the weight function (24), we set m = 0.2 and 1/T = 0.15.For super parameters in the three loss items, we set b 0 = 0.1, b 1 = 2.5, β = 0.3, and γ = 1.8; weight decay on ϕ is fixed to 1 × 10 −3 , and weight decay on η is fixed to 0.001, except for CIFAR10-1k, for which we set it to 5 × 10 −4 .On MNIST, the initial learning rate is fixed to 0.1.On SVHN and CIFAR10, it is fixed to 0.5.
Table 1 compares our method to Vanilla (including batch normalization, dropout and weight decay) and the regularizers proposed in relevant works, in particular, the regularizers based on statistics of representations [1,2] and the topological regularizer as proposed in [22].In addition, we also provide results given by the Jacobian regularizer [9].We report the average test error (%) and the standard deviation over 10 cross-validation runs.The number attached to the dataset names indicates the number of training instances used.It can be seen that our method achieves the lowest average error for MNIST-250, CIFAR10-500, and CIFAR10-1k.For SVHN, the mean error is a little higher than the result presented in [22], but our method achieves a lower variance.Especially, our method outperforms all the regularization methods based on statistical constraints by a significant margin, which demonstrates the advantage of the proposed topology-aware regularizer, and this also supports our claim that mass separation is beneficial.

Conclusions
Traditionally, statistical methods are employed to impose constraints on the internal representation space for deep neural networks, while topological methods are generally underexploited.In this paper, we took a fundamentally different perspective to control internal representation with tools from TDA.By utilizing persistent homology, we constrained the push-forward probability measure and enhanced mass separation in the internal representation space.Specifically, we formulated a property of this measure that is beneficial for generalization for the first time, and we proved that a topological constraint in the representation space leads to mass separation.Moreover, we proposed a novel weight function for weighted Rips filtration, proved its stability and introduced a regularizer that operates on the persistence diagram obtained via persistent homology to control the distribution of the internal representations.
We evaluated our approach in the point cloud optimization task and the image classification task.For the point cloud optimization task, experiments showed that our method can separate points from different classes effectively.For the image classification task, experiments showed that our method significantly outperformed the previous relevant regularization methods, especially those methods based on statistical constraints.
In summary, both theoretical analysis and experimental results showed that our method can provide an effective learning signal utilizing topological information to guide internal representation learning.Our work demonstrated that persistent homology may serve as a novel and powerful tool for promoting topological structure in the internal representation space.Areas for future research are the exploration of the potential of 1-dimensional persistent homology and the development of other topology-aware methods for deep neural networks.

Figure 2 .
Figure 2. Illustration of ( , ; ) b  q s for b = 8 and different values of q .Points a

Figure 1 .
Figure1.Illustration of when Φ(q, q; b) = 1 − c β,γ holds, i.e., when the mass separation effect starts to occur.When q increases, s should decrease, which means that as M l•β covers more mass of Q 1 , it covers less mass of Q 2 .As the batch size b increases, the least mass of Q 1 that M l•β should cover decreases.

Figure 1 . covers more 1 Q
Figure 1.Illustration of when indicates a smaller s , i.e., l M   covers less mass of 2 Q , and ther leads to a better separation.

Figure 2 .
Figure 2. Illustration of ( , ; ) b  q s for b = 8 and different values of q .Points a

Figure 2 .
Figure 2. Illustration of Φ(q, s; b) for b = 8 and different values of q.Points at which 1 − c β,γ = Φ(q, s; b) holds are marked by dots.When q increases, T b,c β,γ (q) moves towards zero, which indicates a smaller s, i.e., M l•β covers less mass of Q 2 .

Figure 3 .
Figure 3. Illustration of T b,c β,γ (q), i.e., the upper bound on s = Q 2 (M l•β ), plotted as a function of the mass q = Q 1 (M l•β ) (for b = 8 and different values of c β,γ).For a fixed q, as c β,γ is increased, the maximal mass of Q 2 contained in M l•β decreases, better separation is achieved.

Figure 4 visualizes
Figure 4 visualizes G b,c β,γ (p, l) as a function of p for different values of c β,γ , where p = Q 1 (M).It can be seen that as c β,γ is increased, the maximal mass of Q 2 contained in M l•β , characterized by G b,c β,γ (p, l), also shifts towards a smaller value, which indicates a better separation.

Figure 3
Figure 3. Illustration of

Figure 4
Figure 4. Illustration of

Figure 4 .
Figure 4. Illustration of G b,c β,γ (p, l), i.e., the upper bound on s = Q 2 (M l•β ), plotted as a function of the mass p = Q 1 (M) (for b = 8, and different values of c β,γ ).For a fixed p, as c β,γ is increased, the maximal mass of Q 2 contained in M l•β decreases.

Figure 5 plots
Figure 5 plots G b,c β,γ (p, l) as a function of p for different values of batch size b, where p = Q 1 (M).As b is increased, the maximal mass of Q 2 contained in M l•β , characterized by G b,c β,γ (p, l), also shifts towards a smaller value, which indicates a better separation, and in order to achieve separation, M only needs to cover a small mass of Q 1 .

Figure 5 .
Figure 5. Illustration of G b,c β,γ (p, l), i.e., the upper bound on s = Q 2 (M l•β ), plotted as a function of the mass p = Q 1 (M) (for c β,γ = 0.95, and different values of b).For a fixed p, as the batch size is increased, the maximal mass of Q 2 contained in M l•β decreases.

Proposition 5 .
Suppose that X and Y are compact and that the Hausdorff distance d H (X, Y) ≤ ε.Then the filtrations V[X, f ] and V[Y, f ] are k-interleaved with k = ε(1 + 2/T).

Figure 6 .
Figure 6.(a) The original points are sampled from a Gaussian mixture with two components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 5000 epochs; the points from the two classes are well separated.

Figure 6 .
Figure 6.(a) The original points are sampled from a Gaussian mixture with two components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 5000 epochs; the points from the two classes are well separated.

Mathematics 2023 ,Figure 7 .
Figure 7. (a) Persistence diagram obtained via persistent homology at epoch 0; (b) Persistence diagram after 5000 epochs.Two connected components can be identified, the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).

Figure 7 .
Figure 7. (a) Persistence diagram obtained via persistent homology at epoch 0; (b) Persistence diagram after 5000 epochs.Two connected components can be identified, the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).

Figure 8 .
Figure 8.(a) The original points are sampled from a Gaussian mixture with four components, and each class corresponds to two components; the purple points are sampled from class 1, and the red Figure 8.(a) The original points are sampled from a Gaussian mixture with four components, and each class corresponds to two components; the purple points are sampled from class 1, and the red points are sampled from class 2; (b) Optimized configuration after 1800 epochs; the points from the two classes are well separated.

Mathematics 2023 ,
11, x FOR PEER REVIEW 19 of 23 points are sampled from class 2; (b) Optimized configuration after 1800 epochs; the points from the two classes are well separated.

Figure 9 .
Figure9visualizes the function values calculated by our weight function(24); these values are used to construct the weighted Rips filtration using Equation (4) to extract topological information, and finally, the topological information is used by the regularizer to guide the optimization.Figure9avisualizes the function values and the contour lines at epoch 0; it can be seen that larger values are assigned for points from C2. Figure9bvisualizes the function values and the contour lines after 1800 epochs.

9 .
(a) Weight function values and contour lines evaluated on the mesh at epoch 0; the points from class 2 correspond to larger values.(b) Weight function values and contour lines evaluated on the mesh after 1800 epochs.

Figure 9 .
Figure 9. (a) Weight function values and contour lines evaluated on the mesh at epoch 0; the points from class 2 correspond to larger values.(b) Weight function values and contour lines evaluated on the mesh after 1800 epochs.

Figure 10 .
Figure 10.(a) Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 1800 epochs.Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).

Figure 10 .
Figure 10.Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 1800 epochs.Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).

Figure 11 .
Figure 11.(a) The original points are sampled from a Gaussian mixture with nine components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 12,500 epochs; the points from the two classes are well separated.

11 .
(a)  The original points are sampled from a Gaussian mixture with nine components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 12,500 epochs; the points from the two classes are well separated.

Figure 11 .Figure 12 .
Figure 11.(a) The original points are sampled from a Gaussian mixture with nine components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 12,500 epochs; the points from the two classes are well separated.

12 .
(a) Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 12,500 epochs.Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).

Table 1 .
Comparison to previous regularizers."Vanilla" includes batch normalization, dropout and weight decay.The average test error and the standard deviation are reported.