Origins of East Caucasus Gene Pool: Contributions of Autochthonous Bronze Age Populations and Migrations from West Asia Estimated from Y-Chromosome Data

The gene pool of the East Caucasus, encompassing modern-day Azerbaijan and Dagestan populations, was studied alongside adjacent populations using 83 Y-chromosome SNP markers. The analysis of genetic distances among 18 populations (N = 2216) representing Nakh-Dagestani, Altaic, and Indo-European language families revealed the presence of three components (Steppe, Iranian, and Dagestani) that emerged in different historical periods. The Steppe component occurs only in Karanogais, indicating a recent medieval migration of Turkic-speaking nomads from the Eurasian steppe. The Iranian component is observed in Azerbaijanis, Dagestani Tabasarans, and all Iranian-speaking peoples of the Caucasus. The Dagestani component predominates in Dagestani-speaking populations, except for Tabasarans, and in Turkic-speaking Kumyks. Each component is associated with distinct Y-chromosome haplogroup complexes: the Steppe includes C-M217, N-LLY22g, R1b-M73, and R1a-M198; the Iranian includes J2-M172(×M67, M12) and R1b-M269; the Dagestani includes J1-Y3495 lineages. We propose J1-Y3495 haplogroup’s most common lineage originated in an autochthonous ancestral population in central Dagestan and splits up ~6 kya into J1-ZS3114 (Dargins, Laks, Lezgi-speaking populations) and J1-CTS1460 (Avar-Andi-Tsez linguistic group). Based on the archeological finds and DNA data, the analysis of J1-Y3495 phylogeography suggests the growth of the population in the territory of modern-day Dagestan that started in the Bronze Age, its further dispersal, and the microevolution of the diverged population.


Introduction
The East Caucasus, which spans modern-day Dagestan and Azerbaijan, is an important land bridge connecting Europe to West Asia.One of the gateways through the Caucasus is the Big Caucasian Pass, a strip of coastal land that runs between the East Caucasus mountains and the Caspian Sea all the way from Derbent to Sumgait.
The early humans of the Oldowan culture dispersed to the East Caucasus about 2 million years ago [1].The ancient Chokh settlement site in today's Dagestan is the key monument of the Mesolithic, Neolithic, and Bronze Age cultures that used to exist in the mountainous areas of the Northeast Caucasus.Evidence from recent radiocarbon dating suggests that domestication of animals, crop cultivation, and pottery making in this region began no later than the late 7th or early 6th millenniums BC [2].Agriculture was booming; according to Nikolai Vavilov, the mountainous parts of Dagestan were the center of terraced farming [3].The economic development during the Neolithic period (crop farming and animal husbandry, the advent of pottery, weaving, stone grinding, and polishing techniques) led to rapid population growth and colonization of new territories.In Azerbaijan, advanced Neolithic societies emerged at the dawn of the 6th millennium BC as a succession from the Neolithic societies of Southwest Asia [4,5].The discovery of metals in the Eneolithic period (~7 kya) spurred the development of technology.The Bronze Age (6-4 kya) was marked by the emergence of new metalwork techniques, economic and social changes (social inequality, tribal confederations), and trade relations with the populations of Southeastern Europe and West Asia.In the Middle Bronze Age, nomads from the Eurasian steppe arrived in the East Caucasus, introducing the kurgan burial custom into the region.
In the 7th-6th centuries BC, the south of the region was infiltrated via the Derbent Pass by the Iranian-speaking Scythians who formed a military-political alliance known as Ashguza on the lands of today's West Azerbaijan and Northwest Iran.Later, this territory was annexed by the Median Kingdom.In the late 1st millennium BC, the state of Caucasian Albania emerged from a tribal confederation in the northern part of the region.It had its own writing system, and its population adopted Christianity in the 4th century AD.Caucasian Albania was largely made up of autochthonous Caucasian tribes that spoke the Lezgic languages of the Nakh-Dagestanian family, although its reign extended over some Iranian tribes too.The territory of modern-day Dagestan was a cradle to such early states as Lakz, Tabarseran, Haidak, Sarir, Gumik, etc., which came into being in the first half of the 1st millennium AD.
During the first few centuries AD, other Iranian tribes, including Sarmatians, Maskut, and Alans, were arriving in the Caucasus.The South Caucasus was raided by Turkicspeaking Huns, Sabirs, and Khazars who came from the North via the Derbent Pass in the early Middle Ages (between the late 4th and 8th centuries) [6].In the 7th century AD, the north of lowland Dagestan was occupied by the rising Khazar Khanate, whereas the south of the region was colonized by the Sasanian Empire in the 4th century.In the 7th-9th centuries, the East Caucasus was controlled by the Arab Khalifate, which was pursuing an aggressive migration policy that resulted in the Islamization of the local population.Incursions of Turkic-speaking Seljuks, the founders of the powerful empire spanning Central and West Asia and the South Caucasus, began in the 11th century as the invaders continued their expansion from Central Asia into the East Caucasus, including Derbent.Turkic peoples from the Cumania (Desht-i-Kipchak) migrated to the South Caucasus in the 11th-12th centuries; their descendants may have contributed to the emergence of the subethnical Azebaijani-Karapapakh group [7].Other Turkic-speaking peoples of the East Caucasus are Kumyks and Nogais.There is no consensus on the origin of Kumyks.The prevailing hypothesis traces their descent to the local autochthonous population that had close ethnocultural contacts with foreign Turkic tribes like Sabir, Khazar, or Kipchak.In turn, Nogais were nomads from a late migration wave; most of them are settled in Dagestan and are represented by the Karanogai population.
After the Tatar-Mongol expansion that started in the 14th century, Islamic influence regained its primacy in Dagestan.The local medieval principalities of that time existed until the 19th century when they were absorbed by the Russian Empire.The southern regions of the East Caucasus were under Iranian influence during the 15th-19th centuries; they were incorporated by the Russian Empire in the aftermath of the Russo-Persian wars when the peace treaties were signed in 1813 and 1828.Later, these territories became the state of Azerbaijan.
Thus, the gene pools of the autochthonous East Caucasus populations were influenced by the massive ancient migration waves of Iranian speakers and later migrations of Eurasian Turkic peoples.The rich history of the region is reflected in the linguistic diversity of the East Caucasus.The languages of the region include the Nakh-Dagestanian languages (spoken by over 30 ethnic and ethnographic groups), the Turkic languages of the Altaic language family (spoken by Azerbaijanis, Kumyks, and Karanogais), and the Iranian languages of the Indo-European language family (spoken by Tats, mountain Jews, Yezidis, Kurds, and the Talysh) [8].This stunning diversity of Eastern Caucasian populations is a major obstacle in the study of the East Caucasus gene pool, even if highly effective genogeographic tools are used, such as genotyping for Y-chromosome markers.
Samples of Azerbaijani populations were studied for Y-chromosome markers by [9][10][11][12]16].Some population studies point to the similarity between the gene pools of Azerbaijanis and their neighbors in the Caucasus and West Asia [11].Azerbaijanis have genetic affinity to the Iranian/Turkic-speaking populations of West Asia but differ from Turkmens [16].According to [20,21], Iranian Azerbaijanis bear genetic similarities to their Iranian-speaking neighbors, while the Azerbaijani-Terekeme of Dagestan have a genetic affinity to Kumyks [23].Two more publications [25,26] report the results of commercial DNA testing among the Azerbaijanis.Briefly, the specimens were obtained from volunteers who were willing to learn about their Y-chromosome lineage and could afford the test.The results of the test revealed the prevalence of West Asian lineages in the Azerbaijani gene pool; however, the accumulated published data suggests that 18% of their gene pool reflects medieval migrations from Central Asia and 6% is associated with migrations from East Asia [25].
The Azerbaijani gene pool remains understudied due to the use of small genotyping panels or small samples [9][10][11][12], or study populations from neighboring regions-Dagestan or Iran [16,20,21,23].It might be difficult to use the data generated by such research at the current level of Y-haplogroup resolution and to comprehensively describe the Azerbaijani gene pool.The data obtained by our colleagues is intended for the analysis of broader subjects than the genetic composition of the East Caucasus.However, the absence of a substantial Cen-tral Asian component and the genetic similarity of Azerbaijanis to the populations of the Middle East and Caucasus give some general idea about their gene pool.
Dagestani populations have been studied for Y-chromosome markers much better [13,14,17,18,22,24,28] than the populations of Azerbaijan.For example, a significant correlation was found between the frequencies of Dagestani haplogroups, their geography and the language (as demonstrated by the lexicostatistical data) [17].A similar finding was reported by Karafet T. [22], who pointed to the high interpopulation diversity of Dagestan's indigenous peoples and confirmed the hypothesis about the descent of all Nakh-Dagestanian speaking populations from a single protopopulation ~6000-6500 years ago.The descent of Dagestani-speaking peoples from a single postglacial ancestral population and the subsequent genetic divergence due to landscape changes, genetic drift, and the founder effect were described in [14].Unlike the mountain people of Dagestan, its lowland populations bear a greater Y-chromosome diversity, which can be explained by their contacts with Central Asian populations [13].The Tsez, who represent the Avar-Andi-Dido language family, demonstrate a modest genetic diversity due to the prevalence of the JI haplogroup and the strong founder effect [24].Dagestan is one of the most ethnically diverse Russian regions; its gene pool was system-atically analyzed in the past using the genotyping panels of Y-chromosome markers available at that time [17,18,22], and the results of the analysis are now outdated.It is known that the J1 haplogroup occurs in the populations of Dagestan at high frequency, so it is important to explore its structure and the distribution of its branches in order to fur-ther analyze the history of Dagestan's populations.
Thus, the aim of this study was to systematically analyze the gene pools of populations inhabiting the historically connected territories of Dagestan and Azerbaijan and to reconstruct the genetic history of the East Caucasus using a broad panel of Y-chromosome markers.The results of this study could be used to model the dispersal of early humans across the Caucasus, West Asia and Eurasia, using the updated and expanded DNA data.

Materials and Methods
Samples.We have analyzed 18 populations from the East Caucasus and the adjacent regions (N = 2216; Table 1) that represent 3 language families: Nakh-Dagestanian (the Avar-Andi-Dido, Dargin, Lak, Lezgi, and Nakh branches), Altaic (the Turkic branch) and Indo-European (the Iranian branch).the focus of the study was on the East Caucasus, the gene pools of the indigenous peoples of Dagestan and Azerbaijan (N = 1329) were analyzed in greater detail than North Caspian steppe populations, Nakh and Iranian speakers of the Caucasus, which constituted the comparison group and were represented by pooled samples without further division into smaller subgroups.The Nakh-Dagestanian language family was represented by 14 ethnic groups: Avars and Tindi (the Avar-Andic language branch), Dido and Hinukh (the Tsez branch), Dargins, Kaitags and Kubachi (the Dargin branch), Laks (the Lak branch), Lezgins, Rutuls, Tabasarans and Tsakhur (the Dargin branch), and Chechens and Ingush (the Nakh branch).Turkic speakers were represented by the populations of the East Caucasus (Karanogais, Kumyks, the Azerbaijani Terekeme of Dagestan, and the Azerbaijanis of Azerbaijan) and the West of the Eurasian steppe (Stavropol Nogais, Trukhmens, Astrakhan Tatars and Nogais).Iranian-speaking populations of the Caucasus were represented by Tats, Mountain Jews, Yezidis, Kurds, and Talysh.More than half of the studied samples are published here for the first time.The samples that had been studied in ( [17,23], Supplementary Table S1) were additionally genotyped using an expanded SNP panel.The biological specimens were collected during the 1998-2018 expeditions under the supervision of Prof. Elena Balanovska, Prof. Oleg Balanovsky, and Prof. Elvira Pocheshkhova.The Azerbaijani population was studied back in 2015-2022; that research initiative was supported by the Azerbaijani expat community.All samples were collected from unrelated male donors whose ancestors from at least 3 previous generations belonged to the studied ethnicity and population.Informed consent was obtained from all donors.The study was approved by the Ethics Committee of the Research Centre for Medical Genetics.
Molecular Genetic Testing.In the preparatory step, DNA was isolated from the samples of venous blood and saliva using a QIAsymphony SP instrument or by phenol-chloroform extraction.DNA concentrations were measured with a Nanodrop 2000 spectrophotometer and a Qubit 4.0 fluorometer.The samples were aliquoted and distributed into well plates for further genotyping using a QIAgility workstation.
The samples were genotyped using OpenArray technology, a QuantStudio 12 Flex Real-Time PCR system, TaqMan probes, and a custom panel of the most informative SNP markers.The set of 83 studied SNPs included two subsets: the "Core" array of 62 SNPs and a J1-M267(×P58) array of 31 SNPs (Supplementary Table S2).The "Core" array covered the main haplogroups of the world's Y-chromosome tree, specific to the indigenous populations of Northern Eurasia, which spans Russia, post-Soviet states and Mongolia.Since the J1-M257(×P58) haplogroup has a pronounced presence in the gene pool of the Eastern Caucasus [17,18], the J1-M257(x(×P58) array was used to study its phylogeography.The dates of the haplogroups' origin provided in this article were borrowed from YFull [https://www.yfull.com/,accessed on 31 May 2023] if not specified otherwise.
Statistical analysis.The pairwise matrix of Nei's genetic distances [29] (Table 2) was computed in the original DJ software [30] from the frequencies of 38 Y-chromosome haplogroups that are polymorphic in the studied populations (Supplementary Table S3).The visual plot for the matrix was created in Statistica 7.0 (StatSoft©, Tulsa, Oklahoma, United States) using the multidimensional scaling method.
Cartographic analysis.Genogeographic maps (frequency distribution maps, maps of genetic distances) were constructed from the genotyping data and the information from the Y-base database (Tables 1 and S1), which had been developed under the supervision of Prof. Oleg Balanovsky.The original GeneGeo software used to create the genogeographic maps had been developed under the supervision of Prof. Elena Balanovska and Prof. Oleg Balanovsky.Frequency distribution maps for the studied Y-chromosome haplogroups were created using the average weighted interpolation procedure with a search radius of 400 km and a weight function inversely proportional to the cube of the distance [31].The maps of genetic distances were constructed for 39 Y-chromosome haplogroups of the studied East Caucasus populations using an algorithm similar to the one described in [32][33][34].Note.The fill color of cells in the 1st column and the 1st row represents the language group/branch spoken by the studied populations: Avaro-Andic group-green, Tsezic (Didoic)-yellow, Dargic branch-bright blue, Lak branch-light blue, Lezgic branch-violet, Nakh branch-grey, Turkic branch-orange, Iranian group-red.The fill color of the cells with figures represents genetic distances: white and light colors correspond to the minimal differences between populations; the richer the brown color, the greater the genetic distance.

The Spectrum of Y-Chromosome Haplogroups in the Populations of the East Caucasus and the Adjacent Regions
While studying the genotyping data of 18 indigenous populations of the East Caucasus and the adjacent regions (N = 2216), we factored their language and area into the analysis.The geographic variability of the analyzed Y-chromosome haplogroups across the East Caucasus (Figure 1, Supplementary Table S3) is not the same for the representatives of different language families: it is more pronounced for Nakh-Dagestanian linguistic groups than for Turkic and Iranian-speaking populations.Overall, we have identified 5 major haplogroups with different geographic distribution patterns: J1, J2, R1a, R1b, and N. Their diversity is demonstrated on the frequency distribution maps (Figure 2).The distribution of haplogroup J1 follows a distinct focal pattern: in Dagestan, its frequency declines in all directions from the center to the periphery (Figure 2D).Haplogroup J2-M172(×M67, M12) also has a focal pattern, but its peak frequencies occur in the south of the region: the cumulative contribution and diversity of J2 branches is greater for the Azerbaijanis and the Iranian-speaking populations of the Caucasus (Figure 2E-G).R1b-M269 echoes this pattern, occurring at high frequencies in the south of the region among the Azerbaijani Talysh (48% [35]; Figure 2E-H).Overall, haplogroups R1a and R1b expose the connection between the steppe populations in the north and the groups settled along the Caspian coastal line in the south.Legend: 1-Karanogais, 2-Chechen, 3-Kumyks, 4-Tindi, 5-Avars, 6-Dargins, 7-Tsez (Dido) and Hinukh, 8-Laks, 9-Kaitak, 10-Azerbaijanis from Dagestan, 11-Kubachi, 12-Tabasarans, 13-Tsakhur, 14-Rutuls, 15-Lezgins, 16-Karapapakhs, 17-Azerbaijanis (including Karapapakhs) from Azerbaijan, 18-Azerbaijanis (excluding Karapapakhs) from Azerbaijan, 19-Iranian-speaking peoples of the Caucasus (Tats, Talysh, Kurds), 20-Iranian-speaking peoples of the Caucasus (Yezidis).
The high frequency of haplogroup J1-M267(×P58) observed for Kumyks may indicate a significant genetic contribution of Nakh-Dagestanian-speaking populations since this haplogroup does not typically occur in other Turkic and Iranian populations of the East Caucasus (Figure 1).
Apart from the genetic drift, the presence of other J1 branches in addition to J1-P58 in the region may be responsible for the high frequency of J1-M267(×P58).Since the early carriers of haplogroup J1 emerged in the Caucasus as early as the Paleolithic (~13 kya; hunter-gatherers in the territory of modern-day Georgia [36,40]) and the rise of farming communities in the Neolithic drove population growth and the emergence of new haplogroups, a few local lineages may have arisen within J1-M267(×P58).This hypothesis will be explored below.
Haplogroup J2 is rarely found in the populations of Dagestan but its J2-M67(×M92) branch occurs at 67% frequency among their neighbors, the Nakh-speaking Chechens and Ingush (Figures 1 and 2F, Supplementary Table S3).J2 is represented at 20% frequency by its clade J2-M172(×M67, M12) in the gene pools of the Azerbaijanis and Iranian-speaking peoples of the Caucasus.
Thus, the frequencies of haplogroup J2 and its branch J2-M172(×M67, M12) among the Azerbaijani-Terekeme of Dagestan and the Azerbaijanis of Azerbaijan are close to the frequencies observed not only for the Iranian-speaking peoples of the Caucasus but also for the populations of West Asia.
Haplogroup N reaches its highest frequencies in the steppe population of Karanogais in the North of Dagestan (22%) and Azerbaijani Karapapakhs (10%, Figure 1, Supplementary Table S3).However, its lineages differ between the two populations: N3a5a-F4205, and N3a2-M2118, both of which are common in Siberia and Central Asia, and the undifferentiated branch N-LLY22g(×M178, Y3205) occur in the Karanogais, whereas N2-Y3205 and N3a1-B211, which are typical for the populations of the Ural region and West Siberia, are found in the Karapapakhs.
Haplogroup R1a is represented largely by the R1a-M198(×M458) branch, which is very common among the Karanogais (18%) and Karapapakhs (19%) but occurs at lower frequencies in other populations of the East Caucasus analyzed in this paper (2-11%, Supplementary Table S3), except for the Tsez, Kubachi, and Tabasaran.Haplogroup R1a is widespread in Eurasia: except for the R1a-M458 branch, which is typical for the populations of Eastern Europe and rarely found in the populations of the East Caucasus, R1a comprises an astonishing variety of subclades that span a vast geography from India to Scandinavia [47].Further analysis of R1a-M198(×M458) phylogeography in the local populations of the East Caucasus and the adjacent regions will allow us to identify the sources of migration.

The Main Patterns of Geographic Variation
The frequency distribution maps reveal the patterns of geographic distribution for individual haplogroups that can be summarized for further analysis.Another cartographic method for exploring these patterns is based on the maps of Nei's genetic distances that aggregate data for all the haplogroups included in the analysis.Every map shows the degree of similarity between the studied populations (Figure 3).The maps of genetic distances expose differences between populations along the entire spectrum of the studied haplogroups, provide a general characteristic of the Y-chromosome gene pool for each population, and highlight regions with the most genetic similarity.
The Dagestani pattern.The maps of genetic distances from the Avar-Andi-Dido, Lak, Dargin, and Lezgi linguistic groups representing the Nakh-Dagestanian language family (Figure 3A-K) share a common pattern, the only exception being the Tabasaran people (Figure 3K).This Dagestani pattern is characterized by close genetic distances between the populations (0.01 < d < 0.11), occurs only in Dagestan, and is not correlated with the language.It is quite distinct although not so intense (0.07 < d < 0.17) for the Turkicspeaking Kumyks (Figure 3M).The Dagestani pattern shows a deep connection between the gene pools of Dagestan's populations in terms of Y-chromosome haplogroups.This pattern is largely defined by the presence of haplogroup J1-M267(×P58): its frequency distribution map (Figure 2D) shows its peak frequencies for most Dagestani populations.Within the scope of this study, a genetic similarity is observed between the populations of Dagestan and North Iran, which is determined by the contribution of haplogroups J1 and J2 (Figure 2D,F) and may echo the earliest waves of migration into the East Caucasus.
the Lezgi branch of the Nakh-Dagestanian language family, although its intensity for the Tabasaran is slightly lower.The Steppe pattern.The maps of Nei's genetic distances from Turkic-speaking populations demonstrate significant differences between these groups (Figure 3L-O).The Kumyk gene pool is close to that of the Caucasian-speaking populations of Dagestan (Figure 3M).
The Azerbaijani gene pool (Figure 3N-O) is similar to that of Iranian-speaking peoples of the Caucasus.The only pattern that reflects a very distinct connection to the gene pools of the Eurasian steppe is found among the Karanagais (Figure 3L).This is not characteristic of other populations studied in this paper, except that there is a slight increase in the haplogroup frequency among the Azerbaijani Karapapakhs.It would be fair to call it the "steppe pattern" since it reflects the genetic link to the populations of the Eurasian steppe.
The Iranian pattern.There is a genetic similarity between all Azerbaijani populations (0.03 < d < 0.21, Figure 3N-O) and the Iranian-speaking populations of the Caucasus (Tats, Talysh, Yezidi, Kurds).The contribution of haplogroups J2-M172(×M67, M12) and R1b-M269 to this pattern is the greatest, which can be seen by comparing the maps of their distribution (Figure 2E,H) and the Iranian pattern of genetic distances (Figure 3K,N,O).This pattern manifests on the map of genetic distances from the Tabasaran, who represent the Lezgi branch of the Nakh-Dagestanian language family, although its intensity for the Tabasaran is slightly lower.
Note.The colors on the maps correspond to the values of genetic distances: the white color is the minimum difference between populations, and the richer the turquoise color, the higher the genetic distances.The color scale (matching colors to values) is shown in the upper right corner of each map.
Overall, the gene pool of the East Caucasus is represented by three main patterns: Dagestani, Iranian, and Steppe.They may correspond to three "layers" of the gene pool formed during different periods in the past.In this hypothesis, the name "Dagestani" shows a connection to the ancient autochthonous Caucasian population that gave rise to a wealth of gene pools (that still retain their unity) due to a powerful genetic drift.The Iranian pattern is associated with an ancient Iranian population descended from Media and earlier waves of migration in the 3rd-2nd millenniums BC.The Steppe pattern is more recent in terms of its origin and its contribution to the gene pool of the East Caucasus; this pattern reflects one of the latest migration waves of Turkic populations.

East Caucasus in Multidimensional Genetic Space
The cartographic analysis of genetic distances identifies the areas of genetic similarity for one population on the one map.In turn, the visual representation of the entire pairwise matrix of Nei's genetic distances (Table 2) by means of multidimensional scaling reveals the relationship among all studied populations.The constructed genetic space (Figure 4) comprises three different clusters (Figure 4) that reflect geographic but not linguistic patterns.The northeastern vector of Eurasian-steppe influence manifests in the Steppe cluster.The southeastern vector reflects the influence of West Asia and shapes the arbitrary Iranian cluster.The vector of autochthonous Caucasian populations exerts its influence in the Dagestani cluster that unites all Dagestani-speaking peoples.The shortest distance is between the Iranian and Steppe clusters ( d = 0.77).The distance between the Iranian and the Dagestani clusters is slightly longer ( d = 0.93); distances between other pairs of clusters are 1.5-2 times greater (Supplementary Table S4).Similar patterns are reproduced in the analysis of principal components (Supplementary Figure S1, Supplementary Table S5).2) and reflects migrations inside the Eurasian steppe, including the medieval migration waves that impacted the gene pools of Nagais and Astrakhan Tatars, the most recent migration of Trukhmens from the Caspian region in the 17th century, etc. Genetic interactions of the past between the populations of the East Caucasus and the Eurasian steppe only show through on the northern border of the region in the gene pool of Dagestani Karanogais and are much weaker for Kumyks and Karapapakhs.
The Iranian cluster comprises representatives of three language families: the Turkicspeaking Azerbaijanis from Dagestan and Azerbaijan, the Iranian-speaking populations of the Caucasus, and the Lezgic-speaking Tabasaran.This cluster is much denser (đ = 0.14, Table 2) than the Steppe cluster but less consolidated than the Dagestani cluster.The presence of the Tabasaran in this cluster might reflect the legacy of the Tabasaran maisum state in the south of Dagestan, where they lived next to other populations of the cluster, or, alternatively, might indicate the impact of medieval migrations from the Tabaristan province in the north of Iran.The Iranian cluster reflects interactions between the local populations and the populations of West Asia and, therefore, can be called Iranian.The populations of the East Caucasus and West Asia maintained their contacts throughout long periods of history; their contacts occurred across the vast territory that stretches beyond the East Caucasus, so further research is needed to study the sources and history of this component.2) and reflects migrations inside the Eurasian steppe, including the medieval migration waves that impacted the gene pools of Nagais and Astrakhan Tatars, the most recent migration of Trukhmens from the Caspian region in the 17th century, etc. Genetic interactions of the past between the populations of the East Caucasus and the Eurasian steppe only show through on the northern border of the region in the gene pool of Dagestani Karanogais and are much weaker for Kumyks and Karapapakhs.
The Iranian cluster comprises representatives of three language families: the Turkicspeaking Azerbaijanis from Dagestan and Azerbaijan, the Iranian-speaking populations of the Caucasus, and the Lezgic-speaking Tabasaran.This cluster is much denser ( d = 0.14, Table 2) than the Steppe cluster but less consolidated than the Dagestani cluster.The presence of the Tabasaran in this cluster might reflect the legacy of the Tabasaran maisum state in the south of Dagestan, where they lived next to other populations of the cluster, or, alternatively, might indicate the impact of medieval migrations from the Tabaristan province in the north of Iran.The Iranian cluster reflects interactions between the local populations and the populations of West Asia and, therefore, can be called Iranian.The populations of the East Caucasus and West Asia maintained their contacts throughout long periods of history; their contacts occurred across the vast territory that stretches beyond the East Caucasus, so further research is needed to study the sources and history of this component.
The Dagestani cluster includes all Caucasian-speaking populations of Dagestan, except for the Tabasaran, and the Turkic-speaking Kumyks of Dagestan that are geographically close to each other (the average Nei's genetic distance between all 11 populations is d = 0.05, Table 2).The manner in which the populations are distributed within the cluster remotely resembles their geography: the Dargin, Lak, and Tsez populations that live further to the north occupy the top and the left part ("the north-west") of the plot.Lezgian populations settled further to the south occupy the "south-east" of the plot, i.e., the opposite portion of the cluster.Kumyks's position on the cluster's border represents the main layer of their gene pool that connects them to Caucasian-speaking populations and suggests genetic "borrowings" from other Turkic-speaking populations.The periphery of the plot is occupied by the pooled Nakh populations (Chechen and Ingush).
The geographically dense Dagestani component of the East Caucasus gene pool found in most Dagestani populations is largely shaped by the high frequency of haplogroup J1-M267(×P58); its structure will be analyzed below.
The Dagestani cluster includes all Caucasian-speaking populations of Dagestan, except for the Tabasaran, and the Turkic-speaking Kumyks of Dagestan that are geographically close to each other (the average Nei's genetic distance between all 11 populations is đ = 0.05, Table 2).The manner in which the populations are distributed within the cluster remotely resembles their geography: the Dargin, Lak, and Tsez populations that live further to the north occupy the top and the left part ("the north-west") of the plot.Lezgian populations settled further to the south occupy the "south-east" of the plot, i.e., the opposite portion of the cluster.Kumyks's position on the cluster's border represents the main layer of their gene pool that connects them to Caucasian-speaking populations and suggests genetic "borrowings" from other Turkic-speaking populations.The periphery of the plot is occupied by the pooled Nakh populations (Chechen and Ingush).
The geographically dense Dagestani component of the East Caucasus gene pool found in most Dagestani populations is largely shaped by the high frequency of haplogroup J1-M267(×P58); its structure will be analyzed below.
So far, subhaplogroup J1-Y3495 has not been described in detail in the literature, but its ancestral lineage J1-Z18375 (or phylogenetic equivalent Z1841) is well known [36].We focused on its phylogenetic equivalent Z1841.For J1-Z1841/Z18375, TMRCA is 8.1 ± 1.0 kya according to YFull [https://www.yfull.com/tree/J-Z1828/,accessed on 31 May 2023] and 6.5 ± 1.5 kya according to [36].This lineage was detected in one of the three representatives of the Kura-Araxes culture (Velikent, Derbent region, Dagestan) that dates back to the Bronze Age (~5000 years ago [48]).This led us to hypothesize its geographic origin in the Caucasus or in the vicinity of the region [36].Its subbranch J1-Z1842 was found in a man from East Anatolia who lived ~5000 years ago [49].The fact that the J1-Z1841/Z18375 lineage was common in Anatolia and Levant about 3000 years ago suggests contacts between the speakers of the Hurro-Urartian and the Nakh-Dagestanian languages [36].
The only lineage that represents haplogroup J1-Z1841 in the populations of the East Caucasus is J1-Y3495.We were able to identify 17 polymorphic lineages of this haplogroup (Supplementary Table S6); frequency distribution maps were constructed for the most common of the identified variants (Figures 5A-D and S2).The maps show that lineages J1-ZS3114 and J1-CTS1460 (~6 kya, Supplementary Table S6), as well as J1-Y3495(×ZS3114,CTS1460), occur almost everywhere in the region (Figure 5B-D).There is a cumulative effect: J1-ZS3114 is more common for the speakers of the Dargin, Lak, and Lezgi branches; J1-CTS1460 is typical for the speakers of the Avar-Andi-Dido languages; J1-Y3495(×ZS3114, CTS1460) is widespread among the Dargin speakers and slightly less frequent in the speakers of Lezgi languages (Supplementary Table S6).The geographies of these three lineages overlap at a noticeable frequency (over 20%) in Dagestan, suggesting that J1-Y3495 may have originated here (Figure 5E).There is only a slight difference in the average dates of origin between haplogroup J1-Y3495 and its two branches J1-ZS3114 and J1-CTS1460 (300 to 500 years), which indicates population growth in the region in the early Bronze Age (~6 kya), the time of new metalwork technologies, social change, and tribal unions.
Linguistic data suggests that the Nakh-Dagestanian protolanguage split occurred at the end of the 3rd millennium BC (~5 kya) [50].This is consistent with our genetic data: older lineages dating back ~6000 years occur in almost all Dagestani populations analyzed in this work and in some Nakh groups (Figure 5, Supplementary Table S6), although frequency peaks for J1-CTS1460 and J1-ZS3114 are observed in a number of different linguistic groups.
Thus, the main portion of the Y gene pool of Dagestan's populations (carriers of haplogroup J1-M267(×P58) can be traced to the ancestral lineage J1-Y3495.We hypothesize that in the late Copper Age and the early Bronze Age (~6.5 kya), carriers of this lineage were settled in the central part of mountainous Dagestan, where the geographies of the main J1-Y3495 clades overlap (Figure 5E), and were part of the community that spoke the ancestral pan-East-Caucasian language [50].Other authors [22] came to the same conclusion based on the analysis of three marker sets (genome-wide, Y-chromosome, and mtDNA panels), although the comparison relied on STR markers and only four SNPs.The single founder of the major haplogroup in the gene pool of Dagestan may suggest a reduction of the ancient population in earlier periods when other lineages that co-existed with J1-Y3495 became extinct because of the bottleneck effect.
The main impact of the Avar-Andi-Dido J1-CTS1460 lineage is observed for its branch J1-Y6916 (5.8 ± 1.1 kya, Supplementary Figure S2A).The gene geography of the most common variants of this lineage (Y6916(×ZS2872), Y61473, ZS2910(×ZS2878, ZS7652), and ZS2878) reveals its geographic center in the areas occupied by Avar-Andi-Dido speakers (Supplementary Figure S2B-F) and extends to some areas further to the south.The dating of the lineages (4-5 kya, Supplementary Table S6) suggests continuous population growth after the lineage split in the diverged population.

Conclusions
Three components of the East Caucasus gene pool are unequal and may have arisen in different periods of population history.
The Steppe component occurs only in the north of the region among Karanogais and reflects the most recent migration wave of Turkic-speaking nomads from the Eurasian steppe in the Middle Ages.Two other components (Dagestani and Iranian) shape two opposite poles and contribute the most to the East Caucasus gene pool.
The Dagestani component is largely shaped by lineage J1-Y3495 (6.5 ± 0.6 kya), whose emergence may have been preceded by a population decline followed by a population growth.The analysis of J1-Y3495 phylogeography allowed us to trace its origin to the central part of mountainous Dagestan, to an ancestral population that spoke the pan-East-Caucasian language.The split in the gene pool after ~6 kya can be linked to the population growth, dispersal of communities, and their long-lasting isolation in the mountains.This is confirmed by the split of this haplogroup in Avar-Andi-Dido populations ~4-5 kya.
The Iranian component shows the genetic similarity of Azerbaijanis, the Tabasaran, and the Iranian-speaking populations of the Caucasus to the populations of West Asia that can be linked to a series of ancient migration waves.This pattern results from the contribution of haplogroups J2-M172(×M67, M12) and R1b-M269 that may have been carried into the region by migrations of West Asian tribes, mostly Iranian speakers, throughout its history.Detailed analysis of the origin of this component and its dating will be conducted in the future in a broader genogeographic context.

Figure 4 .
Figure 4. Eastern Caucasus populations on the multidimensional scaling plot (Stress 0.07, Alienation 0.09).The Steppe cluster is formed by the Karanogais of Dagestan and the pooled population of Turkic speakers inhabiting the north of the Caspian steppe (Stavropol Trukhmens and Nogais, Astrakhan Tatars, and Nogais).It is characterized by the highest heterogeneity (đ = 0.24, Table2) and reflects migrations inside the Eurasian steppe, including the medieval migration waves that impacted the gene pools of Nagais and Astrakhan Tatars, the most recent migration of Trukhmens from the Caspian region in the 17th century, etc. Genetic interactions of the past between the populations of the East Caucasus and the Eurasian steppe only show through on the northern border of the region in the gene pool of Dagestani Karanogais and are much weaker for Kumyks and Karapapakhs.The Iranian cluster comprises representatives of three language families: the Turkicspeaking Azerbaijanis from Dagestan and Azerbaijan, the Iranian-speaking populations of the Caucasus, and the Lezgic-speaking Tabasaran.This cluster is much denser (đ = 0.14, Table2) than the Steppe cluster but less consolidated than the Dagestani cluster.The presence of the Tabasaran in this cluster might reflect the legacy of the Tabasaran maisum state in the south of Dagestan, where they lived next to other populations of the cluster, or, alternatively, might indicate the impact of medieval migrations from the Tabaristan province in the north of Iran.The Iranian cluster reflects interactions between the local populations and the populations of West Asia and, therefore, can be called Iranian.The populations of the East Caucasus and West Asia maintained their contacts throughout long periods of history; their contacts occurred across the vast territory that stretches beyond the East Caucasus, so further research is needed to study the sources and history of this component.

Figure 4 .
Figure 4. Eastern Caucasus populations on the multidimensional scaling plot (Stress 0.07, Alienation 0.09).The Steppe cluster is formed by the Karanogais of Dagestan and the pooled population of Turkic speakers inhabiting the north of the Caspian steppe (Stavropol Trukhmens and Nogais, Astrakhan Tatars, and Nogais).It is characterized by the highest heterogeneity ( d = 0.24, Table2) and reflects migrations inside the Eurasian steppe, including the medieval migration waves that impacted the gene pools of Nagais and Astrakhan Tatars, the most recent migration of Trukhmens from the Caspian region in the 17th century, etc. Genetic interactions of the past between the populations of the East Caucasus and the Eurasian steppe only show through on the northern border of the region in the gene pool of Dagestani Karanogais and are much weaker for Kumyks and Karapapakhs.The Iranian cluster comprises representatives of three language families: the Turkicspeaking Azerbaijanis from Dagestan and Azerbaijan, the Iranian-speaking populations of the Caucasus, and the Lezgic-speaking Tabasaran.This cluster is much denser ( d = 0.14, Table2) than the Steppe cluster but less consolidated than the Dagestani cluster.The presence of the Tabasaran in this cluster might reflect the legacy of the Tabasaran maisum state in the south of Dagestan, where they lived next to other populations of the cluster, or, alternatively, might indicate the impact of medieval migrations from the Tabaristan province in the north of Iran.The Iranian cluster reflects interactions between the local populations and the populations of West Asia and, therefore, can be called Iranian.The populations of the East Caucasus and West Asia maintained their contacts throughout long periods of history; their contacts occurred across the vast territory that stretches beyond the East Caucasus, so further research is needed to study the sources and history of this component.The Dagestani cluster includes all Caucasian-speaking populations of Dagestan, except for the Tabasaran, and the Turkic-speaking Kumyks of Dagestan that are geographically

Table 1 .
Linguistic and geographic characteristics of the studied populations and samples.