Sunday, March 6, 2011

My solution for Vim + Sweave + Python + Latexmk

I of course write all my documents in LaTeX, and lately I've been into using two excellent tools for writing papers on quantitative topics: Sweave (/ɛs wiv/) and the python.sty package. The former allows you to include R code in the TeX source of your document, which is executed as the document is compiled, with its output (including figures) appearing in the document. The latter does the same thing for Python rather than R. With the two of them, one can include all of one's data processing and statistical code in the very paper that reports on them.

Thursday, November 4, 2010

Open Letter to a Job Candidate

As I continue to draft and re-draft my application materials, in between teaching, I haven't had much time for blogging. Here is a useful "open letter" by Robert Port that I found on the UCLA Linguistics site: Open Letter to a Job Candidate

Wednesday, September 15, 2010

Matt Might on Dissertation Proposals

I haven't posted in a while, mostly because I'm engrossed in writing my dissertation proposal. I thought I'd take a time-out, though, to link to a very useful blog post by Matt Might on the nature of dissertation proposals. John Goldsmith was heard to remark that "all grad students should be pointed to this".

See also the recent post by Matt on 10 easy ways to fail a Ph.D.

(Hat tips to Jason Riggle and Alan Yu.)

Monday, August 9, 2010

World Atlas of Language Structures: Collated Data

The World Atlas of Language Structures (or WALS as everyone calls it) is a wonderful database, and accompanying book, put together by a bunch of very good linguists, and hosted by the Max Planck Digital Library. It documents a whole slew of features for over 2,500 languages, though not every feature is known for every language, being a combination of many earlier databases that covered different sets of languages. Happily the data are now freely downloadable; I remember a few years ago having to extract the data from the CD that accompanied the printed version.

My main complaint about the database is that all data are categorical -- even variables that are most naturally numerical, like segment inventory sizes. This means we lose information and, most likely, power in our statistical analyses. Fortunately one should be able to retrieve the numerical values for most variables by going to the original source database (e.g., UPSID), but it means extra work.

I also find the format in which they distribute the data to be a little bit unwieldy. It consists of a main "datapoints" table, in which the column names are just numerical codes, and likewise for the values in each cell. To figure out what anything means, you have to look at two auxiliary tables: one to see what the numerical code of the column refers to, and another to see what the value indicates. For example, if you see that language "aar" (Aari, spoken in Ethiopia) has value "6" for column "92", you have to look those up in the other tables: column 92 is "Position of Polar Question Particles" and value 6 means "No question particle". It seems to me that it would be a lot simpler to just have one large flat table with fully human-readable column names and cell values. This will display just fine in any decent spreadsheet, and the main advantage is that you could just read that one table into R, it will parse the column names and produce factor variables just fine, and you'll be set to go.

So here's some Python code to convert the WALS data into a single flat table:

Saturday, August 7, 2010

The Encyclopedia of Linguistic Laws

If you're looking for a good time, check out The Encyclopedia of Linguistic Laws, hosted by Universität Trier, and apparently maintained by Gabriel Altmann (!), Reinhard Köhler, and Relja Vulanović (for whom I can't seem to find a homepage). It has a fair number of articles, each covering some linguistic phenomenon (e.g., phoneme frequency, morphological productivity) or some quantitative law or mode of analysis (e.g., lexeme nets). Each article is structured nicely and summarizes recent results from the field (or maybe just the author's results?). Most of the articles list Altmann as an author, so I assume the encyclopedia is his brainchild. Seems like a great place for quickly reviewing results that have been published in venues like Glottometrika and JQL over the years.

Geographic dispersion

Some languages from Autotyp, colored by areal grouping.
The AUTOTYP project offers a freely downloadable CSV file containing genealogical and geographical data on ~2,700 languages. The geographical data include a division of languages into areal groups according to a system of continental and sub-continental linguistic regions, as well as geographic coordinates (latitude and longitude) that roughly approximate "what is generally considered to be the sociolinguistic center of the language," according to the manual.

Friday, August 6, 2010

The numbers 1 through 10 in 5,020 languages

Language enthusiast Mark Rosenfelder keeps a pretty darn extensive list of number words in over 5,000 languages. Check it out here. The list includes many extinct languages, artificial languages, reconstructed protolanguages, and dialectal varieties; the number of living, natural languages is 3,952. He keeps an accompanying list of sources here, though not all are scholarly. The main problem with the list is that it doesn't use a consistent orthography (much less the IPA), and it's not always clear what kind of orthography is being used for a given language. It also lacks modern ISO 639-3 identifiers for the languages, so it will be messy to try to cross-reference them with other databases. Still, it's an awful lot of data and I'm sure something interesting could be done with it.