Monday, August 9, 2010

World Atlas of Language Structures: Collated Data

The World Atlas of Language Structures (or WALS as everyone calls it) is a wonderful database, and accompanying book, put together by a bunch of very good linguists, and hosted by the Max Planck Digital Library. It documents a whole slew of features for over 2,500 languages, though not every feature is known for every language, being a combination of many earlier databases that covered different sets of languages. Happily the data are now freely downloadable; I remember a few years ago having to extract the data from the CD that accompanied the printed version.

My main complaint about the database is that all data are categorical -- even variables that are most naturally numerical, like segment inventory sizes. This means we lose information and, most likely, power in our statistical analyses. Fortunately one should be able to retrieve the numerical values for most variables by going to the original source database (e.g., UPSID), but it means extra work.

I also find the format in which they distribute the data to be a little bit unwieldy. It consists of a main "datapoints" table, in which the column names are just numerical codes, and likewise for the values in each cell. To figure out what anything means, you have to look at two auxiliary tables: one to see what the numerical code of the column refers to, and another to see what the value indicates. For example, if you see that language "aar" (Aari, spoken in Ethiopia) has value "6" for column "92", you have to look those up in the other tables: column 92 is "Position of Polar Question Particles" and value 6 means "No question particle". It seems to me that it would be a lot simpler to just have one large flat table with fully human-readable column names and cell values. This will display just fine in any decent spreadsheet, and the main advantage is that you could just read that one table into R, it will parse the column names and produce factor variables just fine, and you'll be set to go.

So here's some Python code to convert the WALS data into a single flat table:


# wals.py -- provides a function to collate the WALS data into a single flat table.
import csv, os

def collateWALS(wals_dir, dest):
    features = csv.DictReader(open(os.sep.join([wals_dir, 'features.csv']), 'r'))
    languages = csv.DictReader(open(os.sep.join([wals_dir, 'languages.csv']), 'r'))
    values = csv.DictReader(open(os.sep.join([wals_dir, 'values.csv']), 'r'))
    
    langDict = dict([(row['wals code'], row) for row in languages])
    valueDict = dict([((row['feature_id'], row['value_id']),
        row['description']) for row in values])
    featureDict = dict([(row['id'], row['name'].replace(' ', '.')) for row in features])

    df = open(os.sep.join([wals_dir, 'datapoints.csv']), 'r')
    datapoints = csv.DictReader(df)
    datapoints.next()
    fields = languages.fieldnames + [featureDict[fid] for fid in
            datapoints.fieldnames if fid in featureDict]
    df.seek(0)
    datapoints = csv.DictReader(df)

    w = csv.DictWriter(open(dest, 'w'), fields)
    w.writerow(dict(zip(fields, fields))) # header
    for drow in datapoints:
        row = {}
        wc = drow['wals_code']
        for k,v in langDict[wc].iteritems():
            row[k] = v

        for f_id, v_id in drow.iteritems():
            if f_id == 'wals_code':
                continue
            f_name = featureDict[f_id]
            if v_id:
                value = valueDict[(f_id, v_id)]
            else:
                value = 'NA'
            row[f_name] = value

        w.writerow(row)
You use it like so:
import wals
wals.collateWALS('path_to_your_wals_data_folder', 'wals.csv')
This will create a file wals.csv, which contains the whole database as one table. Note that spaces are replaced by '.' in the column names.

No comments:

Post a Comment