# Lumáwig: An Efficient Algorithm for Dimension Zero Bottleneck Distance Computation in Topological Data Analysis

^{1}

^{2}

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Introduction

## 2. Bypassing Matchings

**Lemma**

**1.**

**Proof.**

**Lemma**

**2.**

- 1.
- If $max\left(Z\right)\le max({x}_{l},{y}_{l})/2$, then ${d}_{B}(X,Y)=max\left(Z\right).$
- 2.
- If $\zeta <max({x}_{l},{y}_{l})/2<max\left(Z\right)$, then ${d}_{B}(X,Y)=max({x}_{l},{y}_{l})/2.$
- 3.
- If $\zeta \ge max({x}_{l},{y}_{l})/2$ and $m\ge l$ for every m such that ${z}_{m}\ge max({x}_{l},{y}_{l})/2$, then ${d}_{B}(X,Y)=max({x}_{l},{y}_{l})/2.$
- 4.
- If $\zeta \ge max({x}_{l},{y}_{l})/2$ and there exists $m<l$ such that ${z}_{m}\ge max({x}_{l},{y}_{l})/2$, then there exists a bijection τ between X and Y such that one of the three preceding cases holds and where$$max\left|\right|x-{\tau \left(x\right)\left|\right|}_{\infty}<max\left|\right|x-{\varphi \left(x\right)\left|\right|}_{\infty}.$$

**Proof.**

- 1.
- It follows from our remark immediately after (1) that$$max\left|\right|x-{\varphi \left(x\right)\left|\right|}_{\infty}=max\left(Z\right)\le max({x}_{l},{y}_{l})/2=max\left|\right|x-{\varphi}^{\prime}\left(x\right){\left|\right|}_{\infty}$$
- 2.
- In this case, the same bijection ${\varphi}^{\prime}$ in the previous case yields$$max\left|\right|x-{\varphi}^{\prime}{\left(x\right)\left|\right|}_{\infty}=max({x}_{l},{y}_{l})/2<max\left(Z\right)=max\left|\right|x-\varphi \left(x\right){\left|\right|}_{\infty}.$$The same argument in the previous case holds for any other bijection $\psi $. Hence, the inequality above implies the conclusion.
- 3.
- For the bijection ${\varphi}^{\u2033}$ that sends ${x}_{m}$ and ${y}_{m}$ to the diagonal for all such m, and coincides with $\varphi $ otherwise (see Figure 2d), we have that$$max\left|\right|x-{\varphi}^{\u2033}{\left(x\right)\left|\right|}_{\infty}=max({x}_{l},{y}_{l})/2<max\left(Z\right)=max\left|\right|x-\varphi \left(x\right){\left|\right|}_{\infty}.$$Again, since the same argument in the first case holds for any other bijection $\psi $, the previous inequality implies the conclusion.
- 4.
- Define the bijection $\tau $ that sends ${x}_{j}$ and ${y}_{j}$ to the diagonal for all $j\ge l$, and coincides with $\varphi $ otherwise. Then we have that$$max\left|\right|x-{\tau \left(x\right)\left|\right|}_{\infty}<max\left(Z\right)=max\left|\right|x-{\varphi \left(x\right)\left|\right|}_{\infty},$$Moreover, note that $max\left|\right|x-{\tau \left(x\right)\left|\right|}_{\infty}$ depends only on $\left|\right|x-{\tau \left(x\right)\left|\right|}_{\infty}$ for non-trivially matched x and $\tau \left(x\right)$. Therefore, we can consider only the subsets ${X}^{\prime}$ and ${Y}^{\prime}$ respectively of X and Y whose points are non-trivially matched by $\tau $. In this case, $length\left({X}^{\prime}\right)=length\left({Y}^{\prime}\right)$ and one of the three previous cases above holds.The proof is now complete. □

Algorithm 1Lumáwig algorithm for computing 0-dimensional bottleneck distance between two persistence diagrams |

1: Input: Two dimension zero persistence diagrams X and Y such that $X\ne Y$ and where X has fewer than or as many points as Y. |

2: Output: The bottleneck distance between X and Y. |

3: Initialization $d\leftarrow 0$, $X\leftarrow $ death times of points from X sorted from largest to smallest, $Y\leftarrow $ death times of points from Y sorted from largest to smallest, $N=length\left(X\right)$, $Z\leftarrow $ vector $[{z}_{i}:=|{x}_{i}-{y}_{i}|{]}_{1}^{N}$, $l=arg\; max\left(Z\right)$, ${d}_{temp}=max\left(Z\right)$ |

4: if $length\left(X\right)\ne length\left(Y\right)$ and ${d}_{temp}<{y}_{N+1}/2$ then |

5: $d=\left({y}_{N+1}\right)/2$; |

6: else |

7: while $length\left(Z\right)>1$ do |

8: if $\mathrm{Second}\phantom{\rule{4.pt}{0ex}}\mathrm{largest}\phantom{\rule{4.pt}{0ex}}\mathrm{entry}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}Z<max({x}_{l},{y}_{l})/2<{d}_{temp}$ then |

9: $d=max({x}_{l},{y}_{l})/2$ |

10: $break$ |

11: else if $\mathrm{Second}\phantom{\rule{4.pt}{0ex}}\mathrm{largest}\phantom{\rule{4.pt}{0ex}}\mathrm{entry}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}Z\ge max({x}_{l},{y}_{l})/2$ then |

12: if For every m for which ${z}_{m}\ge max({x}_{l},{y}_{l})/2$, $m\ge l$ then |

13: $d=max({x}_{l},{y}_{l})/2$ |

14: $break$ |

15: else |

16: Trim off all ${z}_{m},{x}_{m},{y}_{m}$ for $m\ge l$; update l and ${d}_{temp}$ |

17: if $length\left(Z\right)=1$ then |

18: $d=min({d}_{temp},max({x}_{l},{y}_{l})/2)$ |

19: $break$ |

20: end if |

21: end if |

22: else |

23: $d={d}_{temp}$ |

24: $break$ |

25: end if |

26: end while |

27: end if |

## 3. Benchmarking

#### 3.1. Benchmarking against All Available Algorithms

_{Py}recovers that of Persim at a much less computational running time. Finally, we highlight that Lumáwig${}_{\mathsf{R}}$ recovers the exact output values of the original implementation in Dionysus.

#### 3.2. Benchmarking Lumáwig on Larger Data Sets

_{Py}with respect to that of Lumáwig${}_{\mathsf{R}}$. Consistent with the comparison between the outputs of Hera and Dionysus in Figure 4a, Hera consistently overestimates the dimension zero bottleneck distance with respect to that of Lumáwig${}_{\mathsf{R}}$. In contrast, relative differences between the two implementations of Lumáwig can be attributed to rounding differences between Python and R.

#### 3.3. Complexity Analysis

## 4. Lumáwig in Digit Classification

## 5. Discussions and Conclusions

## 6. Repository for Lumáwig

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Zomorodian, A. Computing and Comprehending Topology: Persistence and Hierarchical Morse Complexes. Ph.D. Thesis, University of Illinois, Urbana-Champaign, IL, USA, 2001. [Google Scholar]
- Edelsbrunner, H.; Letscher, D.; Zomorodian, A. Topological persistence and simplification. Discrete Comput. Geom.
**2002**, 28, 511–533. [Google Scholar] [CrossRef] [Green Version] - Cohen-Steiner, D.; Edelsbrunner, H.; Harer, J. Stability of Persistence Diagrams. Discrete Comput. Geom.
**2007**, 37, 103–120. [Google Scholar] [CrossRef] [Green Version] - Morozov, D. Dionysus Library for Computing Persistent. Available online: homology.mrzv.org/software/dionysus (accessed on 2 September 2019).
- Munkres, J. Algorithms for the assignment and transportation problems. J. Soc. Industr. Appl. Math.
**1957**, 5, 32–38. [Google Scholar] [CrossRef] [Green Version] - Botnan, M.B.; Lesnick, M. Algebraic stability of zigzag persistence modules. Algebr. Geom. Topol.
**2018**, 18, 3133–3204. [Google Scholar] [CrossRef] - Ignacio, P.S.P. Stability of Persistent Directed Clique Homology on Dissimilarity Networks. Ph.D. Thesis, University of Iowa, Iowa City, IA, USA, 2019. [Google Scholar]
- Chowdhury, S.; Mémoli, F. Persistent path homology of directed networks. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’18), New Orleans, LA, USA, 7–10 January 2018; pp. 1152–1169. [Google Scholar]
- Adcock, A.; Rubin, D.; Carlsson, G. Classification of hepatic lesions using the matching metric. Comput. Vis. Image Underst.
**2014**, 121, 36–42. [Google Scholar] [CrossRef] [Green Version] - Seversky, L.; Davis, S.; Berger, M. On Time-Series Topological Data Analysis: New Data and Opportunities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 59–67. [Google Scholar]
- Chowdhury, S.; Mémoli, F. A functorial Dowker theorem and persistent homology of asymmetric networks. J. Appl. Comput. Topol.
**2018**, 2, 115–175. [Google Scholar] [CrossRef] [Green Version] - Bubenik, P. Statistical Topological Data Analysis using Persistence Landscapes. J. Mach. Learn. Res.
**2015**, 16, 77–102. [Google Scholar] - Kerber, M.; Morozov, D.; Nigmetov, A. Geometry Helps to Compare Persistence Diagrams. J. Exp. Algorithmicsm
**2017**, 22, 1–20. [Google Scholar] [CrossRef] [Green Version] - Efrat, A.; Itai, A.; Katz, M. Geometry Helps in Bottleneck Matching and Related Problems. Algorithmica
**2001**, 31, 1. [Google Scholar] [CrossRef] [Green Version] - Hopcroft, J.; Karp, R. An n
^{5/2}algorithm for maximum matchings in bipartite graphs. SIAM J. Comput.**1973**, 2, 225–231. [Google Scholar] [CrossRef] - Garin, A.; Tauzin, G. A Topological “Reading” Lesson: Classification of MNIST using TDA. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1551–1556. [Google Scholar]
- Weber, E.S.; Harding, S.N.; Przybylski, L. Detecting Traffic Incidents Using Persistence Diagrams. Algorithms
**2020**, 13, 222. [Google Scholar] [CrossRef] - Belchi, F.; Pirashvili, M.; Conway, J.; Bennett, M.; Djukanovic, R.; Brodzki, J. Lung Topology Characteristics in patients with Chronic Obstructive Pulmonary Disease. Sci. Rep.
**2018**, 8, 5341. [Google Scholar] [CrossRef] [PubMed] - Fasy, B.; Kim, J.; Lecci, F.; Maria, C. Introduction to the R package TDA. arXiv
**2014**, arXiv:1411.1830. [Google Scholar] - Saul, N.; Tralie, C. Scikit-TDA: Topological Data Analysis for Python. 2019. Available online: https://zenodo.org/record/2533384 (accessed on 9 November 2020). [CrossRef]
- Ignacio, P.S.P. Intrinsic Hierarchical Clustering Behavior Recovers Higher Dimensional Shape Information. arXiv
**2010**, arXiv:2010.03894. [Google Scholar] - Cavanna, N.; Kiselius, O.; Sheehy, D. Computing the shift-invariant bottleneck distance for persistence diagrams. In Proceedings of the Canadian Conference on Computational Geometry, Winnipeg, MB, Canada, 8–10 August 2018. [Google Scholar]

**Figure 1.**A Rips filtration over a point cloud captures the merging dynamics of clusters evolving from points across multiple scales. The dimension zero persistence diagram produced by this filtration is a set of points positioned along an extended vertical line at the merging heights in the corresponding dendrogram, except for the last point positioned at ∞ representing the eventual single component. The neighborhoods around points are colored by the persistent cluster determined by the elder rule.

**Figure 2.**Examples of point-matching between persistence diagrams highlighting the resulting bottleneck distance. Points in each diagrams are shape coded and matched points are color coded. (

**a**) illustrates when diagonal matching achieves the bottleneck distance. (

**b**) is used in the proof of Lemma 1 and (

**c**,

**d**) in Lemma 2.

**Figure 4.**Boxplots of relative differences from Dionysus of the bottleneck computation outputs in (

**a**) Hera, (

**b**) Persim, (

**c**) Lumáwig

_{Py}, and (

**d**) Lumáwig${}_{\mathsf{R}}$.

**Figure 5.**Running time (seconds in log scale) of Lumáwig versus the current state-of-the-art implementation in Hera. Five boxplots for the running time of the original algorithm in Dionysus are superimposed for reference.

**Figure 6.**(

**a**,

**b**) Boxplots of relative differences between the bottleneck computation outputs of the indicated pair of implementations. (

**c**) Heat map of the median running times of Lumáwig${}_{\mathsf{R}}$. Each pixel represents the median running time (in seconds) for 100 computations of dimension zero bottleneck distance between diagrams. The number of points in the diagrams are in units of 1000.

**Figure 7.**Scatter plots with fitted curves of the median running times (in seconds) of Lumáwig${}_{\mathsf{R}}$ over 100 computations of dimension zero bottleneck distance between a base diagram with the labeled number of points and a diagram with k thousands of points, for $k=1,2,\dots ,100$.

**Figure 8.**Median running time in the computation of bottleneck distance between two diagrams with varying size and range settings fitted with regression curves. Superimposed are the minimum and maximum running times over the 100-run simulation per unit of 1000 points to illustrate the running time range, and the narrow darker blue band to show the midspread.

**Figure 9.**Confusion matrix for the average prediction of the random forest over a 10-fold cross validation.

**Table 1.**Summary of significant decrease (at confidence level $\alpha =0.95$) in running time (in seconds) for paired tests versus Hera. Column labels are in thousands of points.

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

Lumáwig_{R} | 0.230 | 0.651 | 1.203 | 1.971 | 2.744 | 3.957 | 5.003 | 6.813 | 8.348 | 9.983 |

Lumáwig_{Py} | 0.231 | 0.659 | 1.221 | 2.004 | 2.797 | 4.037 | 5.107 | 6.928 | 8.498 | 10.181 |

11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |

Lumáwig_{R} | 11.029 | 12.227 | 14.985 | 15.733 | 18.983 | 21.588 | 23.580 | 26.801 | 29.425 | 33.316 |

Lumáwig_{Py} | 11.255 | 12.483 | 15.296 | 16.078 | 19.410 | 22.087 | 24.124 | 27.438 | 30.153 | 34.129 |

21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | |

Lumáwig_{R} | 38.555 | 39.818 | 41.080 | 44.324 | 49.933 | 54.441 | 57.183 | 60.196 | 66.510 | 72.948 |

Lumáwig_{Py} | 39.427 | 40.734 | 42.073 | 45.407 | 51.129 | 55.725 | 58.517 | 61.605 | 68.200 | 74.879 |

Digit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Overall |
---|---|---|---|---|---|---|---|---|---|---|---|

Mean | 0.841 | 0.940 | 0.678 | 0.727 | 0.687 | 0.709 | 0.847 | 0.754 | 0.745 | 0.754 | 0.768 |

Std. Dev. | 0.030 | 0.011 | 0.033 | 0.019 | 0.032 | 0 .037 | 0.030 | 0.020 | 0.015 | 0.032 | 0.011 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ignacio, P.S.; Bulauan, J.-A.; Uminsky, D.
Lumáwig: An Efficient Algorithm for Dimension Zero Bottleneck Distance Computation in Topological Data Analysis. *Algorithms* **2020**, *13*, 291.
https://doi.org/10.3390/a13110291

**AMA Style**

Ignacio PS, Bulauan J-A, Uminsky D.
Lumáwig: An Efficient Algorithm for Dimension Zero Bottleneck Distance Computation in Topological Data Analysis. *Algorithms*. 2020; 13(11):291.
https://doi.org/10.3390/a13110291

**Chicago/Turabian Style**

Ignacio, Paul Samuel, Jay-Anne Bulauan, and David Uminsky.
2020. "Lumáwig: An Efficient Algorithm for Dimension Zero Bottleneck Distance Computation in Topological Data Analysis" *Algorithms* 13, no. 11: 291.
https://doi.org/10.3390/a13110291