Thursday, November 4, 2010
Open Letter to a Job Candidate
As I continue to draft and re-draft my application materials, in between teaching, I haven't had much time for blogging. Here is a useful "open letter" by Robert Port that I found on the UCLA Linguistics site: Open Letter to a Job Candidate
Wednesday, September 15, 2010
Matt Might on Dissertation Proposals
I haven't posted in a while, mostly because I'm engrossed in writing my dissertation proposal. I thought I'd take a time-out, though, to link to a very useful blog post by Matt Might on the nature of dissertation proposals. John Goldsmith was heard to remark that "all grad students should be pointed to this".
See also the recent post by Matt on 10 easy ways to fail a Ph.D.
(Hat tips to Jason Riggle and Alan Yu.)
See also the recent post by Matt on 10 easy ways to fail a Ph.D.
(Hat tips to Jason Riggle and Alan Yu.)
Monday, August 9, 2010
World Atlas of Language Structures: Collated Data
The World Atlas of Language Structures (or WALS as everyone calls it) is a wonderful database, and accompanying book, put together by a bunch of very good linguists, and hosted by the Max Planck Digital Library. It documents a whole slew of features for over 2,500 languages, though not every feature is known for every language, being a combination of many earlier databases that covered different sets of languages. Happily the data are now freely downloadable; I remember a few years ago having to extract the data from the CD that accompanied the printed version.
My main complaint about the database is that all data are categorical -- even variables that are most naturally numerical, like segment inventory sizes. This means we lose information and, most likely, power in our statistical analyses. Fortunately one should be able to retrieve the numerical values for most variables by going to the original source database (e.g., UPSID), but it means extra work.
I also find the format in which they distribute the data to be a little bit unwieldy. It consists of a main "datapoints" table, in which the column names are just numerical codes, and likewise for the values in each cell. To figure out what anything means, you have to look at two auxiliary tables: one to see what the numerical code of the column refers to, and another to see what the value indicates. For example, if you see that language "aar" (Aari, spoken in Ethiopia) has value "6" for column "92", you have to look those up in the other tables: column 92 is "Position of Polar Question Particles" and value 6 means "No question particle". It seems to me that it would be a lot simpler to just have one large flat table with fully human-readable column names and cell values. This will display just fine in any decent spreadsheet, and the main advantage is that you could just read that one table into R, it will parse the column names and produce factor variables just fine, and you'll be set to go.
So here's some Python code to convert the WALS data into a single flat table:
My main complaint about the database is that all data are categorical -- even variables that are most naturally numerical, like segment inventory sizes. This means we lose information and, most likely, power in our statistical analyses. Fortunately one should be able to retrieve the numerical values for most variables by going to the original source database (e.g., UPSID), but it means extra work.
I also find the format in which they distribute the data to be a little bit unwieldy. It consists of a main "datapoints" table, in which the column names are just numerical codes, and likewise for the values in each cell. To figure out what anything means, you have to look at two auxiliary tables: one to see what the numerical code of the column refers to, and another to see what the value indicates. For example, if you see that language "aar" (Aari, spoken in Ethiopia) has value "6" for column "92", you have to look those up in the other tables: column 92 is "Position of Polar Question Particles" and value 6 means "No question particle". It seems to me that it would be a lot simpler to just have one large flat table with fully human-readable column names and cell values. This will display just fine in any decent spreadsheet, and the main advantage is that you could just read that one table into R, it will parse the column names and produce factor variables just fine, and you'll be set to go.
So here's some Python code to convert the WALS data into a single flat table:
Saturday, August 7, 2010
The Encyclopedia of Linguistic Laws
If you're looking for a good time, check out The Encyclopedia of Linguistic Laws, hosted by Universität Trier, and apparently maintained by Gabriel Altmann (!), Reinhard Köhler, and Relja Vulanović (for whom I can't seem to find a homepage). It has a fair number of articles, each covering some linguistic phenomenon (e.g., phoneme frequency, morphological productivity) or some quantitative law or mode of analysis (e.g., lexeme nets). Each article is structured nicely and summarizes recent results from the field (or maybe just the author's results?). Most of the articles list Altmann as an author, so I assume the encyclopedia is his brainchild. Seems like a great place for quickly reviewing results that have been published in venues like Glottometrika and JQL over the years.
Geographic dispersion
Some languages from Autotyp, colored by areal grouping. |
Friday, August 6, 2010
The numbers 1 through 10 in 5,020 languages
Language enthusiast Mark Rosenfelder keeps a pretty darn extensive list of number words in over 5,000 languages. Check it out here. The list includes many extinct languages, artificial languages, reconstructed protolanguages, and dialectal varieties; the number of living, natural languages is 3,952. He keeps an accompanying list of sources here, though not all are scholarly. The main problem with the list is that it doesn't use a consistent orthography (much less the IPA), and it's not always clear what kind of orthography is being used for a given language. It also lacks modern ISO 639-3 identifiers for the languages, so it will be messy to try to cross-reference them with other databases. Still, it's an awful lot of data and I'm sure something interesting could be done with it.
Thursday, August 5, 2010
The Typological Database System (TDS)
Some of the categories of variables catalogued by TDS. |
Speaker populations update: 6,305 languages
As a follow-up to my earlier posting... I've scraped the population figures for every language that has them available in Ethnologue. That turns out to be 6,305 languages, and the result is here. If you sum up the all the populations, it comes to a total of 6,211,860,710 speakers (keeping in mind that many people speak more than one language; and of course that the figures are all very rough).
I also parsed out the years that Ethnologue attaches to the population numbers, and those are included. They update a lot of their population numbers whenever they release a new edition of Ethnologue, so most of them come from the 1990s and 2000s, but one could conceivably still learn something by comparing decades. E.g., do populations from the 1990s tend to be larger than populations from the 2000s?
Also, it seems that earlier editions of Ethnologue remain archived on their site, so one could possibly track population change within individual languages that way. Unfortunately, once you go about two editions back, things start to get messy because they hadn't yet settled on the ISO 639-3 standard for language identifiers.
I also parsed out the years that Ethnologue attaches to the population numbers, and those are included. They update a lot of their population numbers whenever they release a new edition of Ethnologue, so most of them come from the 1990s and 2000s, but one could conceivably still learn something by comparing decades. E.g., do populations from the 1990s tend to be larger than populations from the 2000s?
Also, it seems that earlier editions of Ethnologue remain archived on their site, so one could possibly track population change within individual languages that way. Unfortunately, once you go about two editions back, things start to get messy because they hadn't yet settled on the ISO 639-3 standard for language identifiers.
Population sizes and phonological inventory sizes
Just came across a paper by Vladimir Percliev, of the Department of Mathematical Linguistics (sweet!) at the Bulgarian Academy of Sciences. Title: "There is no correlation between the size of a community speaking a language and the size of the phonological inventory of that language." PDF here. In it he offers evidence against claims by Trudgill that large speaker communities favor languages with medium-sized phonological segment inventories, while small communities favor either small or large segment inventories. Percliev takes population figures from Ethnologue and cross-indexes them with the languages in UPSID (the UCLA Phonological Segment Inventory Database); he doesn't see evidence of he claimed correlation.
So anyway, I'm not the first to scrape population figures from Ethnologue :)
Hat tip to Morgan.
Speaker populations
Ethnologue is a great resource, being a catalogue of almost 7,000 of the world's known languages. An entry contains such information about a language as: the regions where it is spoken, its genetic classification, known dialects, and the approximate number of people who speak it. Lately I've been thinking that this last quantity might be an interesting covariate to consider when looking at typological distributions, so I hacked a python script to scrape it off of the Ethnologue site. Given the three-letter ISO 639-3 identifier of a language, it attempts to retrieve the Ethnologue entry for that language, and extract the population figure.
Subscribe to:
Posts (Atom)