Indo-European Phylogeny
In the previous dispatch I tried to extract a Pleistocene population history signal from phonemic data for subaltern languages. Does this approach actually work? One way to check is to run the exact same algorithm for a well-known phylum. In what follows we'll extract a phylogram for Indo-European. As we shall see, there is a very strong population history signal in phonemic distance metrics.
The Indo-European family is roughly only five thousand years old. It spread across Western Eurasia shortly after the ethnogenesis of the Yamnaya — triggered by the introduction of advanced Sumerian technology, in particular, the wagon and probably Brewer's yeast, that made the economic exploitation of the steppe possible for the first time. (See David Anthony's excellent detective work.) We know this not only from massive whole genome studies but also from overwhelming archaeological evidence. The Yamnaya were a rank society of warrior-pastoralists much given to feasting, drinking, and all-round boisterous male bonding rituals. ("Will you fight with me brother?") Their diagnostic signature in Europe is that male warrior elites are buried with their weapons and ornaments signifying their social status. At home in the steppe and in the more advanced Near East and later India, they can identified by their chariots. Extraordinarily, there were major chariot manufacturing centers in the steppe that supplied the chariot civilizations of the Near East in the second millennium BCE, piggy-backing on the brisk horse trade. More on that later. Let's not get tied down by the extraordinary discoveries of prehistorians.
We begin by testing that Indo-European phonemic diversity is characterized by isolation-by-distance. The Mantel test statistic for pairwise phonemic and geographic distances equals 0.274. The probability of seeing this by chance is less than one in 100,000 ($latex p<10^5$). So there is very good reason to think that phonemic distances contain a strong population history signal.
Figure 1 displays a reduced form version of the full phylogram. What I have done is fold the subtrees corresponding to the main subfamilies. For example, the "North Indian" branch at the very top of the phylogram is a clade with 17 languages that are closely related and geographically centered in north India (Hindi, Gujarati, Marathi, and so on). The "NW Indian" branch below it is a clade with 13 languages that cluster around Pakistan and the Indian northwest (Punjabi, Baluchi, Sindhi, and so on). That the algorithm correctly classifies these languages (and other subfamilies) together is assuring.
Figure 1. Top level breakdown of the Indo-European family.
What the phylogram shows is that the split with the greatest time-depth is the East-West split between the Indian and Western branches. This is as it should be since we know the Yamnaya migration pulses east and west were near contemporaneous and took place almost immediately after the ethnogenesis of the Yamnaya.
Figure 2. Yamnaya expansion 5ka. Source: Reich (2018).
So the phylogram is successful at discerning population history and branching of the Indo-European family at both fine-scale and the top-level splits with the greatest time depth. Where it fails badly is in assigning Latin and Sri Lankan together with the Swedish-Norwegian clade. The suggested close relationship between Breton and ancient Zoroastrian is less than persuasive. Modulo these anomalies, the phylogram is extremely compelling. Figure 3 displays the beast in all its glory. Study it at length and it quickly becomes apparent that the accuracy of the phylogram is simply astonishing.
Figure 3. Indo-European phylogram based on phonemic distance.
So, yes, there is a very strong population history signal in phonemic distances. We are not deluding ourselves that we can recover Pleistocene population history from subaltern languages. The issue is whether the phylograms thus derived are consistent with those obtained from physical anthropology and paleontology. As I mentioned in the previous dispatch, the disconnect between the two is the most important outstanding puzzle of modern research into our origins.
On a separate note, what emerges from this analysis is the extraordinary position of shatter-belts. Language isolates packed into New Guinea (as we saw in the previous post), Iran, Afghanistan etc attest to the history of subaltern populations that were once widespread but have since been pushed into marginal zones. Scott made a compelling case in The Art of Not Being Governed that the ethnolinguistic fragmentation of mainland southeast Asia reflects centuries of attempts at state formation and resistance from below. That is plausible for Iran and Afghanistan (and certainly the Balkans). But it is not a plausible explanation for the extraordinary ethnolinguistic fragmentation found in New Guinea. What is?