Naomi NagyLinguistics at U of T |
Due dates | Task |
---|---|
Coding HW (complete by Week 4) | |
Analysis HW (due Week 7) | |
The purpose of these assignments is to give you hands-on experience with extracting, coding, and analyzing a linguistic variable from natural speech data, using a specialized freeware packages, ELAN, for transcribing and coding data; and Rbrul, a package that runs in R, for statistical analysis.
This is what (variationist) sociolinguists actually do, so you’ll get a chance to see how each step of the research process works. For this project, we will all work on the same sociolinguistic variable: the pronunciation of (r) in Boston English. If you are already comfortable using Goldvarb, and have other data that you will be working with for this course, check with me about substituting that data instead.
Step 1: Download the following paper (from http://repository.upenn.edu/pwpl/vol13/iss2/11/).
Irwin, T. & N. Nagy. 2007. Bostonians /r/ speaking: A quantitative look at (r) in Boston. Penn Working Papers in Linguistics 13.2. Selected papers from NWAV 35. 135-47.
You may need to do a quick read of this article in order to understand the issues involved in the study of the variable (r). Later in the term, as part of the discussion of quantifying contact effects, we will discuss a follow-up article with more details:
Nagy, N. & P. Irwin. 2010. Boston (r): Neighbo(r)s nea(r) and fa(r). Language Variation and Change 22:2.241-78. [see abstract]
A number of data files are required for this project. They are all located in the Corpora in the Classroom server.
Step 2: From Corpora in the Classroom, select the "New England Blizzard" corpus. Download two of the six audio .wav files and the matching .eaf files. (You will need to accept the Corpus Use Form first.) Save the files together in a folder on a computer that you will be able to find and use all semester.
You must agree to the conditions of use of our data files before you may use them. You will be asked to indicate your understanding of the conditions of use via an online Corpus Use form in the Corpora in the Classroom website. No assignments can be accepted for credit unless you have "e-signed" this form.
The .eaf files are the transcribed recordings that will serve as your raw data files. There is no need to print these files. Rather, read through these instructions first, which include tips on how to deal with these data files. This document will guide you step-by-step through the assignment and the analysis.
THE SAMPLE DESIGN: We will be working with data from six speakers, all from Boston. Ideally, each of you will pick different speakers/tokens to code. Here is some info about the speakers.
Speaker code |
Sex |
Age |
---|---|---|
F27A |
female |
27 |
F57A |
female |
57 |
F70A |
female |
70 |
M18A |
male |
18 |
M33A |
male |
33 |
M65W |
male |
65 |
Step 3: Download the software program ELAN and install it on your computer. ELAN is freeware that runs on Mac, Windows and Linux. On this Download page you will also see links to the User Guide and Manual for ELAN, which you can read online or download. (Instructions below are geared toward the Mac OS X version, but it should work very similarly on other operating systems. The online manual is written more for Windows users.)
Note: It will simplify things to install the application in the same folder as the .wav and .eaf files you will use -- this should be an option at least for Mac-users.
[Return to top] [Return to syllabus]
Note: Capitalization matters when you are coding tokens! "R" will not be seen as the same thing as "r" when you start running your analyses.
Open the newly created .txt file in Excel. (If you get the Import window, just click "Finish.") Somewhere in your .txt file (you may need to scroll to the bottom), there should be a few rows that look like this:
[Return to top] [Return to syllabus]
For this Part, you will be working with the two ELAN .eaf files that you created in Part 3.
Note: Capitalization matters when you are coding tokens! "e" will not be seen as the same thing as "E" when you start running your analyses.
Code each of the tokens for two independent variables. To do this, highlight a token, then click in the relevant tier ("preceding vowel" or "following context"), right below the token. In the new field that appears, type in the 1-letter code that describes appropriate context. Be sure to code what you hear, not what the spelling suggests -- this may vary across speakers. If you need to make any notes about questionable tokens, etc., type them in the "Default" tier so you can find them later.
Categorize what you find in the data using the following coding scheme as a start. If you find a token that does not fit the existing categories, you can make up a new category and use it. Make sure to make a note of what your new abbreviation means and submit that with the assignment.
Code | IPA Symbol | Description | Example word |
---|---|---|---|
i | i or ɪ | high front | "beer" |
e | e or ɛ | mid front | "bear" |
a | ɑ, a or æ | low | "bar" (for some speakers, some tokens) |
o | o or ɔ | mid back | "bore" |
u | u or ʊ | high back round | "boor" (or "Bloor") |
2 | schwa | unstressed mid central (schwa) | "runner" |
3 | wedge | stressed mid central (caret) | "purchase" |
x | glide | glides or other sounds | "your" as [jr] or "p'ticipate" |
Code | Description | Example |
---|---|---|
v | Word-final, preceding a vowel | "car is" |
p | Word-final, preceding a pause | "car." |
c | Following consonant, in the next morpheme and in the next syllable | "wintertime" |
d | Following consonant, in the next morpheme but the same syllable | "winters" |
s | Morpheme-internal (Following consonant, in the same morpheme) | "card" |
You can find more examples for each variant in the reading.
Your file should now look something like this.
[Return to top] [Return to syllabus]
Your ELAN file should now look like this (except it won't have a tier for social factor coding).
Once you have coded both files, check your coding for accuracy and consistency.
The next step is to export your coded data, along with the transcription and timestamps, to a text file for statistical analysis.
To export a file:
To prepare your data to turn in for Week 4:
Tip: When you work with a more complex file of your own data, if you include the sentences related to each token, save as tab-delimited (.txt), rather than comma-delimited, to avoid confusing Rbrul by the commas inside any sentences. For this assignment, there are no commas in the tokens, so no problem.
This assignment will not be marked. You are submitting the file so that they can be concatenated and combined to form one bigger token file on which you will continue to work for Week 7.
[Return to top] [Return to syllabus]
At this point, we transfer the data from Excel to Rbrul, a package for conducting distributional and multivariate analysis in R. We will use Rbrul to conduct our analysis. It’s a statistical program created just for sociolinguists. It can provide counts (N), percentages (%) and factor weights (FW) showing how frequently the different variants of your dependent variable appear in various contexts. Although you could do counts and percentages in a spreadsheet program like Excel, Rbrul allows you to go one step further to a multivariate analysis. These allow you to see how much effect each independent variable (aspects of the context: linguistic, social, and stylistic) has on the dependent variable (the phenomenon you’re studying).
[Return to top] [Return to syllabus]
This is where you make sure that Rbrul properly understands the codes you've assigned to each token, and you choose how to sort/group your data. For starters, you do a general sort of all your data, to show the distribution of each dependent variant with respect to each variant of the independent variables.
At the top, you see the factors that you put into the model, along with a measure of the significance (p-value) for each. | |
Next, is a table for each independent variable (factor) in your model. As you can see from the column headers, the columns tell you: | |
1. each variant or "factor" | |
2. the logodds, or how much that variant favours the application value (in our case, [r]) | |
3. the number of tokens in that category | |
4. the percentage of those tokens which have the application value (coded in this example as "1")--counted from the data. This is labeled as a fraction with the application value before the slash and the sum of all values considered after the slash | |
5. the "centered factor weight" or the probability of a token having the application value in this context -- calculated from the model | |
At the bottom of the table is important information about the model. See the Rbrul manual to interpret. | |
Below the table, you see a list of all the variables considered for this model ("Current variables are:"). (In a one-level analysis, this will be the same as the factors listed at the top. In a step-up/step-down, it is possible that not all factors will be selected as significant. There, only significant factors are shown at the top, but all factors considered are listed at the bottom. |
[Return to top] [Return to syllabus]
Linguistic variables | % [r-1] | N [r-1] | Total # of tokens in the category |
---|---|---|---|
Preceding vowel |
|||
|
30 |
3 |
10 |
|
40 |
4 |
10 |
|
50 |
10 |
20 |
|
|||
Following context |
|||
|
40 |
4 |
10 |
|
50 |
1 |
2 |
|
|||
TOTAL |
45 |
135 |
300 |
Note: the numbers in these table templates are made up! You will have to replace them with your own numbers and your own categories, depending on what you find in your data.
Submit Tables 1 and 2. Use the same type of naming convention as above.
So, far, you have been doing univariate analysis – looking at only one independent variable at a time. In Part 9, you will conduct an analysis with several independent variable considered simultaneously. This is very important when your data set does not have a balanced distribution of every combination of every independent variable. That is, when you are dealing with real world data.
[Return to top] [Return to syllabus]
For this part of the project, you want to conduct an analysis that examines the tokens from all speakers together. You need to upload and use a token file that has the tokens for all six speakers in it.
Use [r-1] as the application value. Be sure to properly label what is being counted, in your tables.
In order to find out which variables have a significant effect, you must create a results file with no empty cells and no interacting factors. This may mean combining or deleting certain factors or factor groups. This process is done in the Adjusting menu in Rbrul.
Make sure you have principled reasons for the changes you make. This means that it's ok to combine, for example, following stop and following fricative if there was 95% deletion for stops and 100% for fricatives, because (a) stops and fricative are somewhat similar phonetically and (b) the numbers were fairly similar.
Practice recoding to combine and to delete certain factors. It's helpful to use the "Recode to new column" option in order to NOT overwrite the original values you coded.
See more information in the Rbrul manual that you bookmarked.
To look for interaction between 2 independent variables (meaning an overlap in how two different independent variables divide up your tokens), choose Cross-tab from the Modeling menu and select the 2 groups you are interested in. A new table will be created which shows you the distribution of the dependent variable according to both of these independent variables. You can copy the cross-tab output to an Excel (or Word) document to save and edit it. Use Courier font to make the columns line up.
This analysis takes longer than the one-level, if you have a big token file.
It spits out a lot of text and numbers, but indicates which factors are significant.
You will notice that, generally, the factor groups it finds to be significant are those that have the biggest spread in values in the one-level analysis. If there is a lot of overlap between two different factor groups (i.e., all tokens with a following vowel having been produced by Pat), there may be differences.
If you want to look at the constraints conditioning variation for only a subset of your tokens, for example, only tokens from younger speakers, you can use the Recode function in the Adjust Data menu to exclude tokens from the older speakers. Experiment with this.
For Week 7, submit a document containing the Results file for a one-level binomial analysis for all speakers combined, as Table 3. Be sure to label all columns and rows clearly. It should be arranged like Factor Weight tables we've seen in a number of articles this semester.
Updated December 10, 2014 |