My main complaint about the database is that all data are categorical -- even variables that are most naturally numerical, like segment inventory sizes. This means we lose information and, most likely, power in our statistical analyses. Fortunately one should be able to retrieve the numerical values for most variables by going to the original source database (e.g., UPSID), but it means extra work.
I also find the format in which they distribute the data to be a little bit unwieldy. It consists of a main "datapoints" table, in which the column names are just numerical codes, and likewise for the values in each cell. To figure out what anything means, you have to look at two auxiliary tables: one to see what the numerical code of the column refers to, and another to see what the value indicates. For example, if you see that language "aar" (Aari, spoken in Ethiopia) has value "6" for column "92", you have to look those up in the other tables: column 92 is "Position of Polar Question Particles" and value 6 means "No question particle". It seems to me that it would be a lot simpler to just have one large flat table with fully human-readable column names and cell values. This will display just fine in any decent spreadsheet, and the main advantage is that you could just read that one table into R, it will parse the column names and produce factor variables just fine, and you'll be set to go.
So here's some Python code to convert the WALS data into a single flat table:
# wals.py -- provides a function to collate the WALS data into a single flat table. import csv, os def collateWALS(wals_dir, dest): features = csv.DictReader(open(os.sep.join([wals_dir, 'features.csv']), 'r')) languages = csv.DictReader(open(os.sep.join([wals_dir, 'languages.csv']), 'r')) values = csv.DictReader(open(os.sep.join([wals_dir, 'values.csv']), 'r')) langDict = dict([(row['wals code'], row) for row in languages]) valueDict = dict([((row['feature_id'], row['value_id']), row['description']) for row in values]) featureDict = dict([(row['id'], row['name'].replace(' ', '.')) for row in features]) df = open(os.sep.join([wals_dir, 'datapoints.csv']), 'r') datapoints = csv.DictReader(df) datapoints.next() fields = languages.fieldnames + [featureDict[fid] for fid in datapoints.fieldnames if fid in featureDict] df.seek(0) datapoints = csv.DictReader(df) w = csv.DictWriter(open(dest, 'w'), fields) w.writerow(dict(zip(fields, fields))) # header for drow in datapoints: row = {} wc = drow['wals_code'] for k,v in langDict[wc].iteritems(): row[k] = v for f_id, v_id in drow.iteritems(): if f_id == 'wals_code': continue f_name = featureDict[f_id] if v_id: value = valueDict[(f_id, v_id)] else: value = 'NA' row[f_name] = value w.writerow(row)You use it like so:
import wals wals.collateWALS('path_to_your_wals_data_folder', 'wals.csv')This will create a file wals.csv, which contains the whole database as one table. Note that spaces are replaced by '.' in the column names.
No comments:
Post a Comment