# Clustering of Monolingual Embedding Spaces

## Abstract

## 1. Introduction

## 2. Materials and Methods

#### 2.1. FastText Word Vectors

#### 2.1.1. Embedding Size

#### 2.1.2. Language Families

#### 2.2. Word Level Alignment of Cross-Lingual Word Embeddings

#### 2.2.1. Regression Method

#### 2.2.2. Orthogonal Methods

#### 2.3. Degree of Isomorphism

#### 2.3.1. Eigensimilarity

#### 2.3.2. Gromov–Hausdorff Distance

#### 2.3.3. Relational Similarity

#### 2.4. Clustering of Embedding Spaces

#### 2.4.1. Hierarchical Clustering

#### 2.4.2. Fuzzy C-Means Clustering

## 3. Results

#### 3.1. Hierarchical Clustering: Dendrogram

#### 3.1.1. Hierarchical Clustering: Eigensimilarity

**Impact of Embedding Size**

**Impact of Typological Similarities**

#### 3.1.2. Hierarchical Clustering: Gromov–Hausdorff Distance

**Impact of Embedding Size**

**Impact of Typological Similarity**

#### 3.1.3. Hierarchical Clustering: Relational Similarity

**Impact of Embedding Size**

**Impact of Typological Similarity**

#### 3.2. Fuzzy C-Means Clustering Algorithm

#### 3.2.1. FCM: Eigensimilarity

**Impact of Typological Similarity**

**Impact of Embedding Size**

#### 3.2.2. FCM: Gromov–Hausdorff Distance

**Impact of Embedding Size**

**Impact of Typological Similarity**

#### 3.2.3. FCM: Relational Similarity

**Impact of Embedding Size**

**Impact of Typological Similarity**

## 4. Discussion

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

Language | Cluster Label | 0 | 1 | 2 |
---|---|---|---|---|

Latin | 0 | 0.816125 | 0.066704 | 0.11717 |

Galician | 0 | 0.947083 | 0.015154 | 0.037764 |

Azerbaijani | 0 | 0.74006 | 0.061298 | 0.198643 |

Greek | 0 | 0.798175 | 0.052795 | 0.14903 |

Portuguese | 1 | 0.010361 | 0.951072 | 0.038567 |

Italian | 1 | 0.04371 | 0.742565 | 0.213725 |

Spanish | 1 | 0.00419 | 0.980757 | 0.015053 |

Russian | 1 | 0.058801 | 0.816661 | 0.124538 |

English | 1 | 0.006692 | 0.971533 | 0.021775 |

Belarusian | 2 | 0.075486 | 0.063486 | 0.861028 |

Slovak | 2 | 0.021567 | 0.022205 | 0.956228 |

Romanian | 2 | 0.014296 | 0.020679 | 0.965025 |

Turkish | 2 | 0.089349 | 0.330027 | 0.580625 |

Czech | 2 | 0.054219 | 0.417626 | 0.528155 |

Language | Cluster Label | 0 | 1 | 2 |
---|---|---|---|---|

Azerbaijani | 0 | 0.495611 | 0.049662 | 0.454728 |

Belarusian | 0 | 0.901068 | 0.012263 | 0.086669 |

Slovak | 0 | 0.852028 | 0.023885 | 0.124087 |

Romanian | 0 | 0.689999 | 0.070371 | 0.239631 |

Turkish | 0 | 0.661249 | 0.024543 | 0.314208 |

Czech | 0 | 0.80863 | 0.01881 | 0.172561 |

Russian | 0 | 0.547999 | 0.056037 | 0.395964 |

Galician | 1 | 0.051071 | 0.867585 | 0.081343 |

Italian | 1 | 0.02978 | 0.926727 | 0.043492 |

Latin | 2 | 0.341101 | 0.03313 | 0.625769 |

Greek | 2 | 0.129552 | 0.035976 | 0.834472 |

Portuguese | 2 | 0.160704 | 0.074351 | 0.764946 |

Spanish | 2 | 0.323441 | 0.032605 | 0.643954 |

English | 2 | 0.157306 | 0.090714 | 0.75198 |

Language | Cluster Label | 0 | 1 | 2 |
---|---|---|---|---|

Galician | 0 | 0.441000 | 0.266008 | 0.292991 |

Azerbaijani | 0 | 0.457129 | 0.224005 | 0.318866 |

Slovak | 0 | 0.496797 | 0.127349 | 0.375854 |

Romanian | 0 | 0.487956 | 0.239847 | 0.272196 |

Turkish | 0 | 0.547817 | 0.167559 | 0.284624 |

Czech | 0 | 0.515586 | 0.178321 | 0.306093 |

Portuguese | 1 | 0.158451 | 0.732514 | 0.109034 |

Italian | 1 | 0.155662 | 0.723955 | 0.120383 |

Spanish | 1 | 0.110976 | 0.810494 | 0.078530 |

English | 1 | 0.247609 | 0.535148 | 0.217243 |

Latin | 2 | 0.402406 | 0.152284 | 0.445310 |

Belarusian | 2 | 0.301698 | 0.154456 | 0.543846 |

Greek | 2 | 0.341317 | 0.102445 | 0.556238 |

Russian | 2 | 0.298074 | 0.207581 | 0.494345 |

