Saturday, August 7, 2010

Geographic dispersion

Some languages from Autotyp, colored by areal grouping.
The AUTOTYP project offers a freely downloadable CSV file containing genealogical and geographical data on ~2,700 languages. The geographical data include a division of languages into areal groups according to a system of continental and sub-continental linguistic regions, as well as geographic coordinates (latitude and longitude) that roughly approximate "what is generally considered to be the sociolinguistic center of the language," according to the manual.

It would be nice to know how far apart a given pair of languages are from each other, in terms of linear distance (kilometers) along the surface of the earth. The conversion from geographic to linear coordinate systems turns out to be slightly nontrivial, but this page has the scoop. Here's my translation into Python:
import numpy as np
def geo_dist(latlng0, latlng1):
    """
    Given two (lat, lng) points, returns the shortest distance between them (in
    kilometers) along a great circle. Assumes spherical earth.
    """
    R = 6371 # earth's radius in km
    lat0, lng0 = np.deg2rad(latlng0)
    lat1, lng1 = np.deg2rad(latlng1)

    d = np.arccos(np.sin(lat0)*np.sin(lat1) + \
                  np.cos(lat0)*np.cos(lat1) * \
                  np.cos(lng1-lng0)) * R
    return d
I got to wondering what sort of use this fine-grained geographic data could have when making regression models of some typological feature's distribution, with a particular eye toward controlling for the possible effects of genetic and areal relatedness among languages. One thought is that a group of languages (say, a genetic stock) might be expected to exhibit more internal variance in features if its members are more spread out from each other geographically. The theory would be that geographically dispersed languages are effectively more isolated from each other, while more concentrated groups of languages  have a greater degree of contact and are likely to have more things in common.

We might measure a group's geographical dispersion as something like overall variance in location. The variance of a set of scalars is the average of their squared differences from their mutual mean, so the analogous quantity in our case could be the average distance (no need to square if distance is always positive) of a language from the geographic center of the group.

Finding the midpoint of a group of points on the surface of a sphere is an interesting little problem, and a nice solution is described here. The idea is that you transform the geographic coordinates into points in 3-dimensional space, with the center of the sphere as the origin. You then find the centroid of those x,y,z-points (which is just their component-wise average). This central point will lie beneath the surface of the sphere, but its projection onto the surface corresponds to the geographic midpoint we want. Here's the Python code:
def geo_midpoint(latlng_points, weights=None):
    """
    Given a list of (lat,lng) points, return the (lat,lng) of the geographic
    center of those points, possibly weighted by the optional vector of weights
    (len(weights) must equal len(latlng_points)). Assumes a spherical earth.
    """
    # convert from degrees to rads
    latlng_points = [(np.deg2rad(lat), np.deg2rad(lng)) for (lat, lng) in
            latlng_points]

    # convert lat,lng points to x,y,z coords
    xyz_points = np.array([(np.cos(lat) * np.cos(lng), np.cos(lat) * np.sin(lng),
        np.sin(lat)) for (lat, lng) in latlng_points])

    # find weighted average of the xyz points
    xyz_avg = np.average(xyz_points, axis=0, weights=weights)

    # project xyz_avg back to latitude/longitude
    x, y, z = xyz_avg
    lng = np.arctan2(y, x)
    hyp = np.sqrt(x**2 + y**2)
    lat = np.arctan2(z, hyp)

    # convert back to degrees
    return (np.rad2deg(lat), np.rad2deg(lng))
I've written this to take an optional weights argument, which makes the center a weighted average, so that different points can exert different "pulls" on where the midpoint is. For example, languages might be weighted by the size of their speaker populations, giving some sort of "center of speaker mass".

It's now a pretty easy matter to compute our dispersion value:
def geo_dispersion(latlng_points, weights=None):
    """
    Given a list of latitude, longitude points, returns (d, s), where d is the
    mean distance from geographic midpoint, and s is the standard error of that
    mean.
    """
    midpoint = geo_midpoint(latlng_points, weights=weights)
    d = np.mean([geo_dist(latlng, midpoint) for latlng in latlng_points])
    if d < 1.:
        return 0., 0.
    elif np.isnan(d):
        return 'NA', 'NA'
    s = d / np.sqrt(len(latlng_points))
    return d, s
I computed these (unweighted) dispersion values by continental group (10 groups), by sub-continental group (24 groups), and by genetic stock (169 stocks with > 1 language).

Whisker plots of geographic dispersion by (from left to right): continent, sub-continent, genetic stock (20 largest stocks):



There is considerable variation in geographic dispersion across language groups, at all three levels. One of my first questions was whether dispersion is just measuring group size. It doesn't have to be, since it's averaged over (i.e., divided by the number of) the languages in the group; it's kilometers from center per language. So it controls for group size in that sense. But we might still expect larger groups to have higher dispersion, because dispersion is basically a measure of how much geographic space a group occupies, and it's reasonable to suppose that more languages means more space occupied. A stock containing 100 languages probably covers a lot more ground than a stock of just 2 languages. Indeed, there is a positive correlation between group size and group dispersion, though it's rather weak for the areal groups: r = .21 for the continental grouping, and r = .29 for the sub-continental grouping. For genetic stocks (with at least 2 languages), on the other hand, the correlation is much higher: r = .68. In fact there seems to be some kind of exponential relationship between stock size and stock dispersion; here's a plot on logarithmic axes (the correlation between logarithms is r = .69):


So it's an interesting question whether geographic dispersion is going to be any more useful to us as a covariate than simple group size, at least in the case of language stocks. Dispersion does have the advantage of being continuous, so it should give more fine-grained information about the group than just its size. My original motivation was the hypothesis that geographically dispersed groups of languages should exhibit more typological variability than geographically concentrated groups, and in retrospect, the analogous hypothesis for group size sounds equally plausible (though probably with a different causal story): large language groups should exhibit more typological variability than small groups. In any case, the thing to do now is test both hypotheses on some typological variables.

Since the necessary data are already lying around here at Typfreq Suites, why not start with population size? Stay tuned.

No comments:

Post a Comment