# MASS-UMAP: Fast and Accurate Analog Ensemble Search in Weather Radar Archives

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. UMAP: Uniform Manifold Approximation and Projection

#### 2.2. MASS: Mueen’s Algorithm for Similarity Search

#### 2.3. Meteotrentino Radar Dataset

#### 2.4. MASS-UMAP Workflow

#### 2.5. Evaluation Framework

#### 2.6. Evaluation Part I: Dimensionality Reduction Training and Verification

#### 2.6.1. Stability of Ranked Lists

#### 2.6.2. Jaccard Distance

#### 2.7. Evaluation Part II: Sequence Search Evaluation

## 3. Results

#### 3.1. Exploration of UMAP Embeddings

#### 3.2. Evaluation Part I: Dimensionality Reduction

- limits: $\left|K\right|=8$ with configurations $K=[5,10,15,20,50,100,200,500]$
- components: $\left|D\right|=6$ with configurations $D=[2,5,10,15,20,100]$
- neighbors: $\left|N\right|=6$ with configurations $N=[5,10,50,100,200,1000]$

#### 3.3. Evaluation Part II: Spatiotemporal Analog Search Performance

#### 3.3.1. Analog Quality

#### 3.3.2. Execution Times and Memory Requirements

## 4. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

MSE | Mean Squared Error |

PCA | Principal Component Analysis |

UMAP | Uniform Manifold Approximation and Projection |

MASS | Mueen’s Algorithm for Similarity Search |

AnEn | Analog Ensemble |

## Appendix A

#### Appendix A.1

#### Appendix A.2

**Figure A8.**Example of UMAP Embeddings that show the effect of using different neighbors parameters (n) in two dimensions ($d=2$) on the training set, colored by wet area ratio.

#### Appendix A.3. Effect of Different Query Lengths on Analog Retrieval

**Figure A15.**Example of a query result for $t=6$ frames when using as input (red box) a single radar scan (

**a**) or the whole sequence (

**b**). The matching sequences are marked in green, while in orange are highlighted the time extensions.

**Figure 1.**Data preprocessing pipeline. The whole dataset is first filtered to remove data chunks that do not contain a interesting amount of signal. A bilinear interpolation filter is applied to the images to reduce the resolution from 480 × 480 to 64 × 64 pixels. The transformed dataset is then split into search and verification sets.

**Figure 3.**Workflow of the model development for the UMAP training and verification. The same workflow is used for training and verification of the principal component analysis (PCA), which is used as a comparison method.

**Figure 4.**UMAP embedding visualization of the second and third components for search space (

**a**) and for verification space (

**b**). The embeddings are colored by wet area ratio (WAR).

**Figure 5.**Canberra stability indicator results for PCA with different values of limit k and components d (darker/lower is better). Lower values indicates that the configuration better preserves the rankings found computing MSE on the original images. The mean, standard deviation, and suboptimal scenario, given by the sum of mean and standard deviation, are reported.

**Figure 6.**Jaccard values for PCA with different values of limit k and components d (darker/lower is better). The number in parentheses is the cardinality of the intersections between the top-k PCA list and the top k MSE list. Mean, standard deviation, and the “suboptimal scenario”. given by the sum of mean and standard deviation. are reported.

**Figure 12.**UMAP Jaccard score for the chosen value of neighbor $n=200$ vs. PCA. Only $d=2$ and $d=5$ are drawn for UMAP, as the values are overlapping for d from 5 to 100. In panel (

**b**), the shade represents the standard deviation.

**Figure 13.**Mean MSE values for analog sequences of $t=3$ obtained with PCA ($d=5$ and $d=20$ components), UMAP ($d=5$ components) and MSE search in original space. Dotted lines represent the standard deviation of the MSE.

**Figure 14.**Mean MSE values for analog sequences of $t=6$ obtained with PCA ($d=5$ and $d=20$ components), UMAP ($d=5$ components) and MSE search in original space. Dotted lines represent the standard deviation of the MSE.

**Figure 15.**Mean MSE values for analog sequences of $t=12$ obtained with PCA ($d=5$ and $d=20$ components), UMAP ($d=5$ components) and MSE search in original space. Dotted lines represent the standard deviation of the MSE.

**Figure 16.**Mean MSE values for analog sequences of $t=24$ obtained with PCA ($d=5$ and $d=20$ components), UMAP ($d=5$ components) and MSE search in original space. Dotted lines represent the standard deviation of the MSE.

**Figure 18.**Top-2 most similar sequences found in training set for the query sequence shown in Figure 17 using MSE comparison on the original radar scans.

**Figure 19.**As in Figure 18, but searching PCA embeddings ($d=5$) with MASS. PCA embeddings fail to provide any correspondence with the reference sequences found by MSE.

$mindist$ | UMAP training parameter used to define a minimum distance between elements in the low dimensional representation. In our study this value is fixed to $0.1$. |

$metric$ | UMAP training parameter used to compare images in original space. In this study we use the Euclidean distance (the Euclidean distance is rank invariant with respect to the MSE). |

n | UMAP training parameter used to define the number of nearest neighbors to build the local distance function. N is the set of all tested values of n. |

d | Number of components (dimensions) used by the dimensionality reduction (UMAP/PCA). D is the set of all tested values of d. |

t | Length of the query sequence (number of consecutive radar images) to match. T is the set of all tested values of t. |

k | Number of closest analogues to consider for further processing. K is the set of all tested values of k. |

${l}_{s}$ | Number of radar images in the search set (archive). The search set contains all the valid data from 2010 to 2016. |

${l}_{v}$ | Number of radar images in the verification set (query data). The verification set contains all the valid data from 2017 to 2019. |

Sequence Length | 3 | 6 | 12 | 24 |
---|---|---|---|---|

(1) UMAP Transform | 194 ms ± 6.72 ms | 303 ms ± 8.87 ms | 451 ms ± 11.3 ms | 745 ms ± 15.5 ms |

(2) MASS search | 1.01 s ± 9.11 ms | 1.05 s ± 13.4 ms | 1.12 s ± 23.1 ms | 1.31 s ± 25 ms |

(3) top-k MSE reorder | 11.1 ms ± 0.12 ms | 43.6 ms ± 0.72 ms | 86.4 ms ± 1.27 ms | 172 ms ± 1.11 ms |

MASS-UMAP (1 + 2 + 3) | 1.22 s ± 15.6 ms | 1.37 s ± 23.0 ms | 1.66 s ± 35.7 ms | 2.23 s ± 35.67 ms |

MASS-UMAP end-to-end | 1.18 s ± 22.5 ms | 1.37 s ± 48.4 ms | 1.65 s ± 82.9 ms | 2.3 s ± 11.9 ms |

linear MSE search | 9.59 s ± 1.08 s | 20.4 s ± 1.6 s | 39.5 s ± 3.74 s | 1min 24s ± 1.02 s |

MASS-UMAP speedup | 8.1× | 14.9× | 23.9× | 36.5× |

