Typfreq: 2010

Thursday, November 4, 2010

Open Letter to a Job Candidate

As I continue to draft and re-draft my application materials, in between teaching, I haven't had much time for blogging. Here is a useful "open letter" by Robert Port that I found on the UCLA Linguistics site: Open Letter to a Job Candidate

Wednesday, September 15, 2010

Matt Might on Dissertation Proposals

I haven't posted in a while, mostly because I'm engrossed in writing my dissertation proposal. I thought I'd take a time-out, though, to link to a very useful blog post by Matt Might on the nature of dissertation proposals. John Goldsmith was heard to remark that "all grad students should be pointed to this".

See also the recent post by Matt on 10 easy ways to fail a Ph.D.

(Hat tips to Jason Riggle and Alan Yu.)

Monday, August 9, 2010

World Atlas of Language Structures: Collated Data

The World Atlas of Language Structures (or WALS as everyone calls it) is a wonderful database, and accompanying book, put together by a bunch of very good linguists, and hosted by the Max Planck Digital Library. It documents a whole slew of features for over 2,500 languages, though not every feature is known for every language, being a combination of many earlier databases that covered different sets of languages. Happily the data are now freely downloadable; I remember a few years ago having to extract the data from the CD that accompanied the printed version.

My main complaint about the database is that all data are categorical -- even variables that are most naturally numerical, like segment inventory sizes. This means we lose information and, most likely, power in our statistical analyses. Fortunately one should be able to retrieve the numerical values for most variables by going to the original source database (e.g., UPSID), but it means extra work.

I also find the format in which they distribute the data to be a little bit unwieldy. It consists of a main "datapoints" table, in which the column names are just numerical codes, and likewise for the values in each cell. To figure out what anything means, you have to look at two auxiliary tables: one to see what the numerical code of the column refers to, and another to see what the value indicates. For example, if you see that language "aar" (Aari, spoken in Ethiopia) has value "6" for column "92", you have to look those up in the other tables: column 92 is "Position of Polar Question Particles" and value 6 means "No question particle". It seems to me that it would be a lot simpler to just have one large flat table with fully human-readable column names and cell values. This will display just fine in any decent spreadsheet, and the main advantage is that you could just read that one table into R, it will parse the column names and produce factor variables just fine, and you'll be set to go.

So here's some Python code to convert the WALS data into a single flat table:

The Encyclopedia of Linguistic Laws

If you're looking for a good time, check out The Encyclopedia of Linguistic Laws, hosted by Universität Trier, and apparently maintained by Gabriel Altmann (!), Reinhard Köhler, and Relja Vulanović (for whom I can't seem to find a homepage). It has a fair number of articles, each covering some linguistic phenomenon (e.g., phoneme frequency, morphological productivity) or some quantitative law or mode of analysis (e.g., lexeme nets). Each article is structured nicely and summarizes recent results from the field (or maybe just the author's results?). Most of the articles list Altmann as an author, so I assume the encyclopedia is his brainchild. Seems like a great place for quickly reviewing results that have been published in venues like Glottometrika and JQL over the years.

Geographic dispersion

Some languages from Autotyp, colored by areal grouping.

The AUTOTYP project offers a freely downloadable CSV file containing genealogical and geographical data on ~2,700 languages. The geographical data include a division of languages into areal groups according to a system of continental and sub-continental linguistic regions, as well as geographic coordinates (latitude and longitude) that roughly approximate "what is generally considered to be the sociolinguistic center of the language," according to the manual.

The numbers 1 through 10 in 5,020 languages

Language enthusiast Mark Rosenfelder keeps a pretty darn extensive list of number words in over 5,000 languages. Check it out here. The list includes many extinct languages, artificial languages, reconstructed protolanguages, and dialectal varieties; the number of living, natural languages is 3,952. He keeps an accompanying list of sources here, though not all are scholarly. The main problem with the list is that it doesn't use a consistent orthography (much less the IPA), and it's not always clear what kind of orthography is being used for a given language. It also lacks modern ISO 639-3 identifiers for the languages, so it will be messy to try to cross-reference them with other databases. Still, it's an awful lot of data and I'm sure something interesting could be done with it.

Thursday, August 5, 2010

The Typological Database System (TDS)

Some of the categories of variables catalogued by TDS.

The Typological Database System (TDS) is a really sweet resource for typological data, and it would be silly not to mention it early on in this blog. It's basically a meta-database, combining and cross-indexing information from 14 existing typological databases, and covering a total of 1,166 languages (though not all information is available for all languages). Its sources include the Anaphora Typology Database, the UCLA Phonological Segment Inventory Database, the World Color Survey, the Person Agreement Database, the Graz Database on Reduplication, and many others, giving a good mix of phonological, morphosyntactic, and semantic phenomena. On the right you can see a screenshot showing some of the topics covered. The web-based interface is particularly nice, and friendly for getting your hands on the raw data. My only beef is that it doesn't work with Chrome.

Speaker populations update: 6,305 languages

As a follow-up to my earlier posting... I've scraped the population figures for every language that has them available in Ethnologue. That turns out to be 6,305 languages, and the result is here. If you sum up the all the populations, it comes to a total of 6,211,860,710 speakers (keeping in mind that many people speak more than one language; and of course that the figures are all very rough).

I also parsed out the years that Ethnologue attaches to the population numbers, and those are included. They update a lot of their population numbers whenever they release a new edition of Ethnologue, so most of them come from the 1990s and 2000s, but one could conceivably still learn something by comparing decades. E.g., do populations from the 1990s tend to be larger than populations from the 2000s?

Also, it seems that earlier editions of Ethnologue remain archived on their site, so one could possibly track population change within individual languages that way. Unfortunately, once you go about two editions back, things start to get messy because they hadn't yet settled on the ISO 639-3 standard for language identifiers.

Population sizes and phonological inventory sizes

Just came across a paper by Vladimir Percliev, of the Department of Mathematical Linguistics (sweet!) at the Bulgarian Academy of Sciences. Title: "There is no correlation between the size of a community speaking a language and the size of the phonological inventory of that language." PDF here. In it he offers evidence against claims by Trudgill that large speaker communities favor languages with medium-sized phonological segment inventories, while small communities favor either small or large segment inventories. Percliev takes population figures from Ethnologue and cross-indexes them with the languages in UPSID (the UCLA Phonological Segment Inventory Database); he doesn't see evidence of he claimed correlation.

So anyway, I'm not the first to scrape population figures from Ethnologue :)

Hat tip to Morgan.

Speaker populations

Ethnologue is a great resource, being a catalogue of almost 7,000 of the world's known languages. An entry contains such information about a language as: the regions where it is spoken, its genetic classification, known dialects, and the approximate number of people who speak it. Lately I've been thinking that this last quantity might be an interesting covariate to consider when looking at typological distributions, so I hacked a python script to scrape it off of the Ethnologue site. Given the three-letter ISO 639-3 identifier of a language, it attempts to retrieve the Ethnologue entry for that language, and extract the population figure.

Typfreq