Some languages from Autotyp, colored by areal grouping. |
It would be nice to know how far apart a given pair of languages are from each other, in terms of linear distance (kilometers) along the surface of the earth. The conversion from geographic to linear coordinate systems turns out to be slightly nontrivial, but this page has the scoop. Here's my translation into Python:
import numpy as np def geo_dist(latlng0, latlng1): """ Given two (lat, lng) points, returns the shortest distance between them (in kilometers) along a great circle. Assumes spherical earth. """ R = 6371 # earth's radius in km lat0, lng0 = np.deg2rad(latlng0) lat1, lng1 = np.deg2rad(latlng1) d = np.arccos(np.sin(lat0)*np.sin(lat1) + \ np.cos(lat0)*np.cos(lat1) * \ np.cos(lng1-lng0)) * R return dI got to wondering what sort of use this fine-grained geographic data could have when making regression models of some typological feature's distribution, with a particular eye toward controlling for the possible effects of genetic and areal relatedness among languages. One thought is that a group of languages (say, a genetic stock) might be expected to exhibit more internal variance in features if its members are more spread out from each other geographically. The theory would be that geographically dispersed languages are effectively more isolated from each other, while more concentrated groups of languages have a greater degree of contact and are likely to have more things in common.
We might measure a group's geographical dispersion as something like overall variance in location. The variance of a set of scalars is the average of their squared differences from their mutual mean, so the analogous quantity in our case could be the average distance (no need to square if distance is always positive) of a language from the geographic center of the group.
Finding the midpoint of a group of points on the surface of a sphere is an interesting little problem, and a nice solution is described here. The idea is that you transform the geographic coordinates into points in 3-dimensional space, with the center of the sphere as the origin. You then find the centroid of those x,y,z-points (which is just their component-wise average). This central point will lie beneath the surface of the sphere, but its projection onto the surface corresponds to the geographic midpoint we want. Here's the Python code:
def geo_midpoint(latlng_points, weights=None): """ Given a list of (lat,lng) points, return the (lat,lng) of the geographic center of those points, possibly weighted by the optional vector of weights (len(weights) must equal len(latlng_points)). Assumes a spherical earth. """ # convert from degrees to rads latlng_points = [(np.deg2rad(lat), np.deg2rad(lng)) for (lat, lng) in latlng_points] # convert lat,lng points to x,y,z coords xyz_points = np.array([(np.cos(lat) * np.cos(lng), np.cos(lat) * np.sin(lng), np.sin(lat)) for (lat, lng) in latlng_points]) # find weighted average of the xyz points xyz_avg = np.average(xyz_points, axis=0, weights=weights) # project xyz_avg back to latitude/longitude x, y, z = xyz_avg lng = np.arctan2(y, x) hyp = np.sqrt(x**2 + y**2) lat = np.arctan2(z, hyp) # convert back to degrees return (np.rad2deg(lat), np.rad2deg(lng))I've written this to take an optional weights argument, which makes the center a weighted average, so that different points can exert different "pulls" on where the midpoint is. For example, languages might be weighted by the size of their speaker populations, giving some sort of "center of speaker mass".
It's now a pretty easy matter to compute our dispersion value:
def geo_dispersion(latlng_points, weights=None): """ Given a list of latitude, longitude points, returns (d, s), where d is the mean distance from geographic midpoint, and s is the standard error of that mean. """ midpoint = geo_midpoint(latlng_points, weights=weights) d = np.mean([geo_dist(latlng, midpoint) for latlng in latlng_points]) if d < 1.: return 0., 0. elif np.isnan(d): return 'NA', 'NA' s = d / np.sqrt(len(latlng_points)) return d, sI computed these (unweighted) dispersion values by continental group (10 groups), by sub-continental group (24 groups), and by genetic stock (169 stocks with > 1 language).
So it's an interesting question whether geographic dispersion is going to be any more useful to us as a covariate than simple group size, at least in the case of language stocks. Dispersion does have the advantage of being continuous, so it should give more fine-grained information about the group than just its size. My original motivation was the hypothesis that geographically dispersed groups of languages should exhibit more typological variability than geographically concentrated groups, and in retrospect, the analogous hypothesis for group size sounds equally plausible (though probably with a different causal story): large language groups should exhibit more typological variability than small groups. In any case, the thing to do now is test both hypotheses on some typological variables.
Since the necessary data are already lying around here at Typfreq Suites, why not start with population size? Stay tuned.
No comments:
Post a Comment