Phonemic Phylograms for Subaltern Languages
The picture of population history that emerges from physical anthropology (molecular, craniometric, craniodental, etc) is consistent with an Out-of-Africa model of the ethnographic and historic present. All non-African populations descend from anatomically modern (Homo sapiens sensu stricto) founder populations that dispersed from Africa 130-40ka. The evidence from paleontology (fossil molecular, craniometric, craniodental, etc) complicates this picture. There were evidently successful interspecific families (if Sapiens, Neanderthals, Denisovans, etc, are regarded as species) or interracial families (if they are regarded as allopatric subspecies) — the DNA of non-African people can only contain sequences acquired from them if their issue survived to procreate. Put another way, all lineages of the other taxa encountered by Sapiens could not have vanished without issue for otherwise their signature could not still be found in contemporary populations. [So these were at best allospecies.]
The picture is complicated further when we look at the archaeological evidence. The evidence is inconsistent with Out-of-Africa in a thick sense — that Sapiens replaced Neanderthals sensu lato because they were behaviorally modern and the latter were not. There is no evidence of modern behavior in the assemblages associated with anatomically modern humans within and without Africa for tens of thousands of years after the "speciation" of Sapiens and the subsequent Out-of-Africa dispersals. The onset of modern behavior is late, staggered and impossible to reconcile with the simple Human Revolution story laid by Klein, Mellars, Stringer, Gamble, Tattersall and others from the late 1980s onwards. (Although Gamble seems to have since changed his mind.)
A further source of evidence is linguistic. Cavalli-Sforza and others began to show in the 1990s that linguistic phylograms resembled those derived from physical anthropology. Atkinson remade the case in the last decade. It has since been debunked, among others by Crenza et al. (2015). Phylograms extracted from linguistic data are not consistent with phylograms obtained from physical anthropology. Why should that be? Something very interesting is going on with this disconnect.
The problem with phonemic data is not that the population history signal is confounded by unstable rates of innovation. As we shall see, this is not the problem. The problem is rather that the Holocene Filter confounds the Pleistocene population history signal. Most people on the planet today speak languages in a small handful of families (Indo-European, Sino-Tibetan, Bantu, Nilo-Saharan, Dravidian, Austronesian etc) that underwent massive geographic and population size expansions during the Holocene. The ethnogenesis of these families was trigged by the agricultural (9ka) and pastoral (5ka) revolutions. Contemporary populations in Eurasia, Africa and elsewhere as well as those of the ethnographic present descend from very recent migrants (the men more so than the women since the migrations were always sex-biased) superimposed over yet older strata of Holocene expanders. Underneath these massive boulders are ancient populations of hunter-gatherers like the San, the Andamanese and thousands of others who survive as isolates in deserts, on islands, in dense forests and mountain redoubts, and suchlike. If we are interested in Pleistocene history, we need to isolate and study the substratum of hunter-gatherer populations of the ethnographic present. Here we make such an attempt.
In order to recover Pleistocene population history we have to figure a way of controlling for the Holocene filter. I think there is a simple method that can work if we have a large enough phonemic database. Populations (roughly identified as the ancestors of the speakers of language families) that underwent Holocene expansions can be expected to be large today for that very reason. This means that if we throw out all the big language families our sample will become more representative of the ancient substrata.
We examine the phonemic data collated by Crenza et al. (2015) from the Ethnologue Database. After throwing out language families with speaker populations larger than those of the Hmong-Mein family (who were largely overrun and driven out of China and into the shatter-belt of Indochina by the Han) and New World populations (who are known to have reached the New World after the Last Glacial Maximum) we obtain a sample of 103 languages that have been classified into 19 languages by linguists. These have a good claim to be direct descendents of languages spoken by Pleistocene populations in the Old World and Sahul before the Mesolithic (New World) and Holocene expansions. Figure 1 displays the latitude and longitude of these languages.
The dataset consists of Boolean variables denoting the presence or absence of 728 phonemes (vowels and consonants). Phonemic distance is computed from the number of shared phonemes (the Hamming metric).
We begin by testing that isolation-by-distance explains pairwise phonemic distance. The appropriate test is a Mantel test comparing geographic distance (computed by the Haversine formula using known waypoints) and phonemic distance. We obtain a robust test statistic equal to 0.585. The probability of observing this value of chance is less than one in a million ($latex p<10^-6$). At the level of language phyla (Crenza et al. report the highest language phylum for each language, not family per se), the Mantel test statistic is a still robust 0.350 and statistically significant (p=0.041). So phonemic distance displays the same isolation-by-distance as physical anthropology distances. We can thus be confident that phonemic distance contains a population history signal.
Figure 2 presents the phylogram (lineage tree) obtained at the level of language phyla as the languages are classified today. We derive the phylogenetic tree from sequential neighbor joining algorithm.
Figure 2. Phylogram based on phonemic distance. Source: Creanza et al. (2015), Ethnologue Database, author's computations.
What stands out is the isolation of the Hmong, the Khoisan and the Indian Aborigines (whose location places them in Jim Corbett national park in the foothills of the Himalayas). Interestingly, the Australians, with their unusual languages (they don't have word order) are placed next to the northeastern Siberians, the Chukotko-Kamchatkan language family. Andamanese is closest to West Papuan suggesting that this subaltern population was once widespread across the southern dispersal route.
Can we recover the language classification (ie, the "families") by looking at phonemic distance at the level of languages? Figure 2 displays the phylogram for families. For each language, we display the family, the ISO code for the language, and the latitude and longitude in parentheses.
Figure 3. Phylogram based on phonemic distance. Source: Creanza et al. (2015), Ethnologue Database, author's computations.
The first, and most reassuring thing to note is that the language families as identified by linguists are largely placed together. The Australians have a complicated structure but they are classified together (the bottom half of the phylogram above the Hmong). Ditto the Hmong and the Khoisan. The fact that recognized phyla are generally classified together is very strong evidence of a population history signal in phonemic data.
The big anomaly is the classification of New Guinea languages. Unlike Australian, Khoisan, and Hmong languages, the Trans-New Guinea phylum does not cluster together. Rather there seem to be meaningful multiple clusters of languages that have been classified as falling within the Trans-New Guinean family. This is quite possibly due to the fact that New Guinea served as the shatter-belt par excellence in our deep history. What this phylogram suggests is that scholars may have misclassified New Guinea languages; in particular by not recognizing enough language families on this extraordinarily diverse island.
In Figure 4 we have folded up the tree to reveal the underlying big relationship together with the great anomalies. The collapsed subtrees and branch nodes are labeled in blue. Those that are not labeled "Branch #" are subtrees with many languages in the same family underneath. The anomalies are interesting. Why is the language of Indian aborigines close to Khoisan languages in Africa? Why is one Andamanese language classified with Sahul languages and another with Basque in Europe and the Hmong in China? Even more intriguing is the phylogenetic affinity (surely spurious) between the Hmong and Australian families. So some of these anomalies are probably random. But others correspond to actual population history. Recall that the speakers of the predecessors of these languages probably occupied much larger areas than they do at present. The most striking feature of course is the phonemic incoherence of languages folded into the Trans-New Guinea. Above all else, what it attests to is the sheer variation in New Guinea. Could the island have served as a shatter-belt in the Pleistocene?
Figure 4. Manually folded version of Figure 3. Only subtrees with languages from the same phylum have been folded.
This is a work in progress. My goal is to cointegrate the information from physical anthropology, paleontology, archaeology and linguistics to tell a more compelling history of the Pleistocene than has so far been on offer. Bear with me.