TO SEE AN OLDER VERSION OF THIS ASSIGNMENT (THAT DOESN'T USE ELAN), CLICK HERE.
Assignment | Due date |
What to turn in |
---|---|---|
Jan. 26 |
Description of 10 variables | |
Feb. 29 |
Creating your token file in ELAN | |
Mar. 21 |
||
Tables 1-2 | ||
Mar. 28 |
Table 3 | |
Apr. 4 |
Essay |
The purpose of the remaining assignments for this course is to:
This is what sociolinguists actually do, so you’ll get a chance to see how each step of the research process works. For this project, we will all work on the same sociolinguistic variable: the pronunciation of (r) in Boston English.
Step 1: Download the following paper from Blackboard. Click the “Books and Reading” button. Or link to it directly.
Irwin, T. & N. Nagy. 2007. Bostonians /r/ speaking: A quantitative look at (r) in Boston. Penn Working Papers in Linguistics 13.2. Selected papers from NWAV 35. 135-47.
You will need to read this article in order to understand the issues involved in the study of the variable (r). Later in the term, you will read a follow-up article with more information:
Nagy, N. & P. Irwin. 2010. Boston (r): Neighbo(r)s nea(r) and fa(r). Language Variation and Change 22:2.241-78. [see abstract]
A number of data files are required for this project. They are all located in the "Resources for the Mini-project" folder in "Assignments" in Blackboard.
Step 2: From Blackboard, download the six audio .wav files and the six .eaf files. (They may be combined as one .zip file.) Save the files together in a folder on a computer that you will be able to find and use all semester.
You must agree to the conditions of use of our data files before you may use them. To indicate your understanding of the conditions, print, read carefully, sign and submit the Corpus Data Use form with HW 1 or 2. No assignments will be accepted for credit unless this form has been signed.
The .eaf files are the transcribed recordings that will serve as your raw data files. Thre is no need to print these files. Rather, read through these instructions first, which include tips on how to deal with these data files. This document will guide you step-by-step through the assignment and the analysis.
THE SAMPLE DESIGN: You will be working with data from six speakers, all from Boston. Here is some info about them.
Step 3: Download the software program ELAN and install it on your computer. ELAN is freeware that runs on Mac, Windows and Linux. On this Download page you will also see links to the manual for ELAN, which you can read/download as a pdf or browse online. (Instructions below are geared toward the Mac OS X version, but it should work very similarly on other operating systems. The online manual is written more for Windows users.)
Notes:
[Return to top] [Return to syllabus]
Note: Capitalization matters when you are coding tokens! "R" will not be seen as the same thing as "r" when you start running your analyses.
Open the newly created .txt file in Excel. (If you get the Import window, just click "Finish.") Somewhere in your .txt file (you may need to scroll to the bottom), there should be a few rows that look like this:
[Return to top] [Return to syllabus]
For this Part, you will be working with the six ELAN .eaf files that you created in Part 3.
Note: Capitalization matters when you are coding tokens! "e" will not be seen as the same thing as "E" when you start running your analyses.
Code each of the tokens for two independent variables. To do this, highlight a token, then click in the relevant tier ("preceding vowel" or "following context"), right below the token. In the new field that appears, type in the 1-letter code that describes appropriate context. Be sure to code what you hear, not what the spelling suggests -- this may vary across speakers. If you need to make any notes about questionable tokens, etc., type them in the "Default" tier so you can find them later.
Categorize what you find in the data using the following coding scheme as a start. If you find a token that does not fit the existing categories, you can make up a new category and use it. Make sure to make a note of what your new abbreviation means and submit that with the assignment.
Code | IPA Symbol | Description | Example word |
---|---|---|---|
i | i or ɪ | high front | "beer" |
e | e or ɛ | mid front | "bear" |
a | ɑ, a or æ | low | "bar" (for some speakers, some tokens) |
o | o or ɔ | mid back | "bore" |
u | u or ʊ | high back round | "boor" (or "Bloor") |
2 | ə | unstressed mid central (schwa) | "runner" |
3 | ʌ | stressed mid central (caret) | "purchase" |
x | j | glides or other sounds | "your" as [jr] or "p'ticipate" |
Code | Description | Example |
---|---|---|
v | Word-final, preceding a vowel | "car is" |
p | Word-final, preceding a pause | "car." |
c | Following consonant, in the next morpheme and in the next syllable | "wintertime" |
d | Following consonant, in the next morpheme but the same syllable | "winters" |
s | Morpheme-internal (Following consonant, in the same morpheme) | "card" |
You can find more examples for each variant in the reading.
Your file should now look something like this.
[Return to top] [Return to syllabus]
The final step in coding your tokens is to add a single annotation in the "social factor codes" tier. This annotation must be the length of the entire recording. It should contain a 3-letter code that describes the speaker: age group + sex + ethnicity. Refer back to Table 1: The Speaker Sample.
Your ELAN file should now look like this.
Once you have coded all six files, check your coding for accuracy and consistency.
The next step is to export your coded data, along with the transcription and timestamps, to a .txt file for statistical analysis.
To export a file:
To prepare your data to turn in:
Submit your signed Corpus Data Release Form in tutorial the day this is due, if you did not do so previously.
No credit will be given to any student who has not submitted this form. You will find the form in Blackboard.
[Return to top] [Return to syllabus]
At this point, we transfer the data from Excel to Goldvarb. We will use Goldvarb to conduct our analysis. It’s a statistical program created just for sociolinguists. It can provide counts (N), percentages (%) and factor weights showing how frequently the different variants of your dependent variable appear in various contexts. Although you could do counts and percentages in a spreadsheet program like Excel, Goldvarb allows you to go one step further to a multivariate analysis. These allow you to see how much effect each independent variable (aspects of the context: linguistic, social, and stylistic) has on the dependent variable (the phenomenon you’re studying).
Download Goldvarb. It is available for Mac, Windows, or Linux. See link in Blackboard or download directly from:
http://individual.utoronto.ca/tagliamonte/goldvarb.htmGoldvarb requires the code for each token to be enclosed between a left parenthesis “(“ and some white space. So, to easily create a token file that is formatted appropriately, add the following formula to the first blank cell to the right of your first coded token in your Excel spreadsheet: ="("&E2&F2&G2&H2&" "&D2&" "&C2. (Between the two double quotes that look adjacent on this webpage, you should type 3 spaces. There are 2 pairs of adjacent double quotes) Be sure to type this formula into your Spreadsheet rather than copying it from the webpage. Apparently there are issues with the quotation marks otherwise.
Your spreadsheet should now look like this:
A B C D E F G H I 1 Begin Time main speaker tokens dependent variable preceding vowel following context social factor codes Token as it appears in Column I Token formula 2 00:00.8 the following words are from... words r 2 c YFW (r2cYFW words the following words are from... ="("&D2&E2&F2&G2&" "&C2&" "&B2 3 00:00.9 the following words are from... are r a v YFW (ravYFW words the following words are from... Note: Cell I2 shows the formula that you should type in. The cells in Column H show what it will produce in this table. (Type it once into Cell I2 and Fill Down for the rest of the column.) If your columns are in a different order, either re-order them or change the formula accordingly.
This will create an appropriately formatted token in the cell where you typed this function.
Copy this formula down the full column so that you have each token formatted this way.
Copy only this column (here, Column H) into a New Token file in Goldvarb. To do this:
FYI, there is a "search & replace" command in the "Tokens" menu which you can use to automate repetitive tasks. For example, assume every line needs to have a "N" as the second element of the token (representing the speaker "Naomi", and you have used "y" and "n" as the first element of your token. Select "search & replace" and replace each "y" with "yN" and then each "n" with "nN".
FYI, You can cut and paste to and from a token file for editing, just like in Word.
Save the token file. Name it LIN351_Last-Name_First-Name.tkn.
Check your token file.
[Return to top] [Return to syllabus]
This is where you choose how to sort your data. For starters, you do a general sort of all your data, to show the distribution of each dependent variable with respect to each variant of the independent variable.
(
; Identity recode: All groups included as is.
(1)
(2)
(3)
)
This means: "Didn't do anything to any of the groups. Just use them all (elements (1), (2) and (3) of each token) as they are, with the stuff in column (1) as the dependent variable (because it appears first).
This file shows you the distribution of the dependent variable with respect to each variant of the independent variables. It looks like this:
When you created it | |
. tkn and .cnd files uses | |
Conditions you selected | |
Summary of the distribution | |
Details of the distribution of variants of the dependent variable (listed across the top) for each independent variable (listed along the left side) |
This table lists the possible variants of the independent variable (Y and N) across the top of the table and the possible variants of each dependent variable, one per row. So this Results file compares tokens with "Y" vs. "N" as the independent variable, i.e., deleted vs. non-deleted (t,d). Note: You will not use “Y” and “N” for your dependent variable.
The first independent variable examined is following segment, with "C" for following consonant and "V" for following vowel. It shows that 67% of the words with a following consonant had deleted (t,d), but 0% of the words with a following vowel had deleted (t,d).
The second independent variable examined is Speaker, with "P" for Pat and "N" for Naomi (pseudonyms, of course). We see that Speaker N deleted (t,d) in 33% of her tokens and Speaker P in 100%
The word * KnockOut * appears in every line that has a "0" value in it.
Finally, we see that overall, for the whole token set, (t,d) deletion occurred 50% of the time, or in 2 out of 4 tokens.
Note: You can copy and paste this table into a Word document, such as a research paper. The columns will line up if you choose the Courier font. It will have strings of spaces rather than tabs, which can be a pain to edit, but you can fix it up as necessary. You can also edit it right in the Results window, which can be dangerous. But, if you do mess it up, you can also reconstruct a new results file by going back to the Cell menu and selecting Load cells to memory.
[Return to top] [Return to syllabus]
Linguistic variables | % [r-1] | N [r-1] | Total # of tokens in the category |
---|---|---|---|
Preceding vowel |
|||
|
30 |
3 |
10 |
|
40 |
4 |
10 |
|
50 |
10 |
20 |
|
|||
Following context |
|||
|
40 |
4 |
10 |
|
50 |
1 |
2 |
|
|||
TOTAL |
45 |
135 |
300 |
Note: the numbers in these table templates are made up! You will have to replace them with your own numbers and your own categories, depending on what you find in your data.
Submit Tables 1 and 2 as well as a print-out of your Goldvarb .Res file (for Part 8). Use the same type of naming convention as for the previous assignment.
Note: If you don't have a printer and GoldVarb on the same computer, you may need to COPY the contents of your Results file to a Word .doc or some such, save it, and take/send it to a computer with a printer. (Because you won't be able to easily open a .Res file on a computer where Goldvarb isn't installed.)
So, far, you have been doing univariate analysis – looking at only one independent variable at a time. In the next Part, you will learn to how to conduct an analysis with more than one independent variable included. This is very important when your data set does not have a balanced distribution of every combination of every independent variable. That is, when you are dealing with real world data.
[Return to top] [Return to syllabus]
For this part of the project, you want to conduct an analysis that examines the tokens from all speakers together. You need to use a token file that has the tokens for all six speakers in it.
Use [r-1] as the application value. Be sure to properly label what is being counted, in your tables.
In order to find out which variables are significant, you must create a results file with no "Knockouts, " i.e., no "0" values. This may mean combining or deleting certain factors or factor groups. This process is done by creating a new Conditions file.
Make sure you have principled reasons for the changes you make. This means that it's ok to combine, for example, following stop and following fricative if there was 95% deletion for stops and 100% for fricatives, because (a) stops and fricative are somewhat similar phonetically and (b) the numbers were fairly similar.
Select Recode setup from the Tokens menu. First, copy over the dependent variable from the left to the right side, using the Copy button. Then, for any factor groups that you wish to leave intact, select them on the left side (clicking on their factor group number) and then click Copy.
To exclude a certain factor, click on it on the left side, then click Exclude and say ok. Then copy the factor group. Although the excluded token will still show up, tokens containing it will be ignored in the analysis.
To recode a factor group (normally to combine two categories that were coded separately), select it, choose recode, and then type over the letters on the right to show how you want to recode them.
Make sure it worked. Do this by going back to the Create cells step and creating the distribution tables again and making sure that there are no knock-outs. If it didn't work, try a new condition file. This may get tedious, so you might copy down your coding from the Conditions window. You should be able to figure out most of the Lisp code.
See more information in the "How to recode" document in Bb (or download here; you'll need the sample .tkn file, too).
After producing the usual (distributional) results by selecting Load cells to Memory, choose Binomial, one-level from the Cells menu.
This will create a table showing the Factor Weight of each factor, in addition to the percentages (App/Total). It looks pretty much like all the tables of weights and probabilities you've seen in various articles, but to get the number of tokens, you need to scroll back up to your distributional results. The weights are the values for p1, p2, etc., in the logistic equations we looked at, representing the effect that each factor has on whether the rule applies. (The factor weights are for the "Application value," which is whichever value of the dependent variable prints in the first column of the distribution table that you will have just made.)
It will also show the frequency (percentage) for each factor, for that same Application value. (We will ignore than Input&Weight column.)This report also gives an Input value, which is the overall probability (po) of the application value occurring. That should always be reported.
So for any one combination of factors (e.g., following consonant, Pat as speaker) we could calculate the probability of deletion by combining the po value with the appropriate p1, p2, for each factor group.) But, we don't have to do it because Goldvarb does it for us-- those weights combined will equal the probability of a certain type of token undergoing the rule application.
If you want to look at the constraints conditioning variation for only a subset of your tokens, for example, only tokens from younger speakers, you can use the Recode set-up to exclude tokens from the older speakers. Experiment with this.
Choose Binomial, up & down for analysis of which factors are significant. (You don't need to do this step for the assignment. It's more relevant when you have more factors.)
This analysis takes longer than the one-level, if you have a big token file.
It spits out a lot of text and numbers, but indicates which factors are significant.
You will notice that, generally, the factor groups it finds to be significant are those that have the biggest spread in values in the one-level analysis. If there is a lot of overlap between two different factor groups (i.e., all tokens with a following vowel having been produced by Pat), there may be differences.
If you want to look for that type of overlap, or any interaction between 2 independent variables, choose Cross-tab from the Cells menu and select the 2 groups you are interested in. A new table will be created which shows you their distribution. You can save it as Text or Picture. The "Picture" one looks nicer when copied into another document, but can't be edited, and takes up more disk space. The "Text" one can be edited, and has to be, in order to be legible. Use Courier font to make the columns line up.
For Part 9, submit a document containing the Results file for a one-level binomial analysis for all speakers combined, as Table 3. Be sure to label all columns and rows clearly. It should be arranged like Factor Weight tables we've seen in a number of articles this semester.
See the note above about printing issues.
[Return to top] [Return to syllabus]
Submit a brief report (1-2 pages of prose) answering the following questions. Include the necessary tables from your own analysis, clearly labeled.
Compare your results with the results presented in Nagy & Irwin (2010). What’s the same? What’s different?
As you do this, think about things like:
How might the age of the speakers be relevant, and how do they compare with their age-mates in the published studies?
Which variant is favoured in by each age group?
How does the proportion of [r-1] in each linguistic context compare to the published results? If it’s different, what could explain the differences?
NOTE: The various files shown as examples are nonsense, un-related, and/or not created from real data.
[Return to top] [Return to syllabus]
Updated March 26, 2012