LIN 351 2012 Winter
Sociolinguistics mini-project


Assignments for LIN 351
Due date
What to turn in
Part 1: Define dependent variables
Jan. 26
Description of 10 variables
Part 2: Get set up for the mini-project
Feb. 29
Creating your token file in ELAN
Part 3: Mark (R) tokens in ELAN
Part 4: Code the independent linguistic variables in ELAN
Part 5: Code the independent social variables in ELAN and export to a .txt file
Part 6: Get Goldvarb started
Mar. 21
Part 7: Work on the distributional analysis
Part 8: Report distributional analyses
Tables 1-2
Part 9: Calculate and report factor weights
Mar. 28
Table 3
Part 10: Report on your comparative analysis
Apr. 4

Be sure to put your name and student number at the top of each assignment.

Part 1: Defining Dependent Variables

Part 2: Getting set up for the project

The purpose of the remaining assignments for this course is to:

This is what sociolinguists actually do, so you’ll get a chance to see how each step of the research process works. For this project, we will all work on the same sociolinguistic variable: the pronunciation of (r) in Boston English.

Step 1: Download the following paper from Blackboard. Click the “Books and Reading” button. Or link to it directly.

Irwin, T. & N. Nagy. 2007. Bostonians /r/ speaking: A quantitative look at (r) in Boston. Penn Working Papers in Linguistics 13.2. Selected papers from NWAV 35. 135-47.

You will need to read this article in order to understand the issues involved in the study of the variable (r). Later in the term, you will read a follow-up article with more information:

Nagy, N. & P. Irwin. 2010. Boston (r): Neighbo(r)s nea(r) and fa(r). Language Variation and Change 22:2.241-78. [see abstract]

A number of data files are required for this project. They are all located in the "Resources for the Mini-project" folder in "Assignments" in Blackboard.

Step 2: From Blackboard, download the six audio .wav files and the six .eaf files. (They may be combined as one .zip file.) Save the files together in a folder on a computer that you will be able to find and use all semester.

You must agree to the conditions of use of our data files before you may use them. To indicate your understanding of the conditions, print, read carefully, sign and submit the Corpus Data Use form with HW 1 or 2. No assignments will be accepted for credit unless this form has been signed.

The .eaf files are the transcribed recordings that will serve as your raw data files. Thre is no need to print these files. Rather, read through these instructions first, which include tips on how to deal with these data files. This document will guide you step-by-step through the assignment and the analysis.

THE SAMPLE DESIGN: You will be working with data from six speakers, all from Boston. Here is some info about them.

The Sample Design
Speaker code

Step 3: Download the software program ELAN and install it on your computer. ELAN is freeware that runs on Mac, Windows and Linux. On this Download page you will also see links to the manual for ELAN, which you can read/download as a pdf or browse online. (Instructions below are geared toward the Mac OS X version, but it should work very similarly on other operating systems. The online manual is written more for Windows users.)


[Return to top] [Return to syllabus]

Part 3: Set up ELAN and mark 50 (r) tokens in ELAN for each speaker

  1. Start ELAN.
  2. Find and label tokens of your dependent variable.

  3. Code your dependent variable.


  4. Follow the same procedure to mark and code at least 50 tokens of (r) for each of the 6 speakers.

    [Return to top] [Return to syllabus]

    Part 4: Code the independent linguistic variables in ELAN

    For this Part, you will be working with the six ELAN .eaf files that you created in Part 3.

    Note: Capitalization matters when you are coding tokens! "e" will not be seen as the same thing as "E" when you start running your analyses.

    Code each of the tokens for two independent variables. To do this, highlight a token, then click in the relevant tier ("preceding vowel" or "following context"), right below the token. In the new field that appears, type in the 1-letter code that describes appropriate context. Be sure to code what you hear, not what the spelling suggests -- this may vary across speakers. If you need to make any notes about questionable tokens, etc., type them in the "Default" tier so you can find them later.

    Categorize what you find in the data using the following coding scheme as a start. If you find a token that does not fit the existing categories, you can make up a new category and use it. Make sure to make a note of what your new abbreviation means and submit that with the assignment.

    Independent linguistic variable #1: Preceding vowel
    Code IPA Symbol Description Example word
    i i or ɪ high front "beer"
    e e or ɛ mid front "bear"
    a ɑ, a or æ low "bar" (for some speakers, some tokens)
    o o or ɔ mid back "bore"
    u u or ʊ high back round "boor" (or "Bloor")
    2 ə unstressed mid central (schwa) "runner"
    3 ʌ stressed mid central (caret) "purchase"
    x j glides or other sounds "your" as [jr] or "p'ticipate"

    Independent linguistic variable #2: Following context
    Code Description Example
    v Word-final, preceding a vowel "car is"
    p Word-final, preceding a pause "car."
    c Following consonant, in the next morpheme and in the next syllable "wintertime"
    d Following consonant, in the next morpheme but the same syllable "winters"
    s Morpheme-internal (Following consonant, in the same morpheme) "card"

    You can find more examples for each variant in the reading.
    Your file should now look something like this.

    [Return to top] [Return to syllabus]

    Part 5: Code the independent social variables in ELAN and export to a .txt file

    The final step in coding your tokens is to add a single annotation in the "social factor codes" tier. This annotation must be the length of the entire recording. It should contain a 3-letter code that describes the speaker: age group + sex + ethnicity. Refer back to Table 1: The Speaker Sample.

    Your ELAN file should now look like this.

    Once you have coded all six files, check your coding for accuracy and consistency.

    The next step is to export your coded data, along with the transcription and timestamps, to a .txt file for statistical analysis.

    To export a file:

    1. In ELAN, choose Export as > Tab-delimited Text from the File menu. (If you are brave, you can experiment with the Export Multiple Files As function.)
    2. In the Select tiers box, click main speaker , tokens, and the tiers for all the variables you coded (dependent and independent, linguistic and social).
    3. In Output options, click "Separate column for each tier" and "Repeat values..."
    4. In Include time column for, click "Begin Time" and "End Time."
    5. In Include time format: click on the first box.
    6. Click OK.
    7. Name the file SPEAKERCODE_YOURLASTNAME_YOURFIRSTNAME.txt when you save it.
    8. Follow the same process for all six speakers.

    To prepare your data to turn in:

    1. Open each .txt file (one for each speaker) in Excel.
    2. Paste them, one below the other into one Excel file. Make sure the same kind of information appears consistently in each column, for all the speakers.
    3. Save this new Excel file as YOURLASTNAME_YOURFIRSTNAME_LIN351_tokens.xls.
    4. Select All.
    5. From the Data menu, choose Sort.... Sort by "social factor codes" and then by "Begin Time."
    6. After sorting, please delete all the rows that do NOT contain tokens. (They will have nothing in the "Begin Time" or "Dependent variable" columns, so it's easy to select them all and delete in one action.)
    7. Save As... YOURLASTNAME_YOURFIRSTNAME_LIN351_tokens.txt. (This is a tab-delimited text file format.)
    8. Submit this .txt file electronically in the Assignments section of Blackboard.

    Submit your signed Corpus Data Release Form in tutorial the day this is due, if you did not do so previously.
    No credit will be given to any student who has not submitted this form. You will find the form in Blackboard.

    [Return to top] [Return to syllabus]

    Part 6: Getting Goldvarb started

    At this point, we transfer the data from Excel to Goldvarb. We will use Goldvarb to conduct our analysis. It’s a statistical program created just for sociolinguists. It can provide counts (N), percentages (%) and factor weights showing how frequently the different variants of your dependent variable appear in various contexts. Although you could do counts and percentages in a spreadsheet program like Excel, Goldvarb allows you to go one step further to a multivariate analysis. These allow you to see how much effect each independent variable (aspects of the context: linguistic, social, and stylistic) has on the dependent variable (the phenomenon you’re studying).

      Download Goldvarb. It is available for Mac, Windows, or Linux. See link in Blackboard or download directly from:

      Goldvarb requires the code for each token to be enclosed between a left parenthesis “(“ and some white space. So, to easily create a token file that is formatted appropriately, add the following formula to the first blank cell to the right of your first coded token in your Excel spreadsheet: ="("&E2&F2&G2&H2&" "&D2&" "&C2. (Between the two double quotes that look adjacent on this webpage, you should type 3 spaces. There are 2 pairs of adjacent double quotes) Be sure to type this formula into your Spreadsheet rather than copying it from the webpage. Apparently there are issues with the quotation marks otherwise.

      Your spreadsheet should now look like this:

        A B C D E F G H I
      1 Begin Time main speaker tokens dependent variable preceding vowel following context social factor codes Token as it appears in Column I Token formula
      2 00:00.8 the following words are from... words r 2 c YFW (r2cYFW words the following words are from... ="("&D2&E2&F2&G2&" "&C2&" "&B2
      3 00:00.9 the following words are from... are r a v YFW (ravYFW words the following words are from... Note: Cell I2 shows the formula that you should type in. The cells in Column H show what it will produce in this table. (Type it once into Cell I2 and Fill Down for the rest of the column.)

      If your columns are in a different order, either re-order them or change the formula accordingly.

      This will create an appropriately formatted token in the cell where you typed this function.

      Copy this formula down the full column so that you have each token formatted this way.

      Copy only this column (here, Column H) into a New Token file in Goldvarb. To do this:

    Prepare the token file

    1. Open Goldvarb.
    2. Select "New" in the "File" menu.
    3. Add ".tkn" as a suffix to whatever title you choose.
    4. Select "Tokens" as the type of new file.
    5. When it asks you to select number of groups, type the number of elements you will have in each token (7). Then hit return or "OK". You can always change the number of elements, so just click "OK" and go on if you aren't sure.
    6. Enter your tokens, one per line. You will easily do this by pasting in the one column of formatted tokens from your Excel spreadsheet (Refer back to Part 6.).
    In general:
  5. FYI, there is a "search & replace" command in the "Tokens" menu which you can use to automate repetitive tasks. For example, assume every line needs to have a "N" as the second element of the token (representing the speaker "Naomi", and you have used "y" and "n" as the first element of your token. Select "search & replace" and replace each "y" with "yN" and then each "n" with "nN".

    FYI, You can cut and paste to and from a token file for editing, just like in Word.

    Save the token file. Name it LIN351_Last-Name_First-Name.tkn.

    Check your token file.

    1. In the Token menu, choose "Generate factor specifications." This will make a list of all the characters you used in each column. You can see these in the little window below your token file. Click through the factor groups and make sure there are no anomalous characters (likely indicating typos or missing elements in a token).
    2. Alternatively, you can enter the factor specifications into the "Factor specification" window and then select "Check tokens" at the program will look for lines that have an anomalous character (i.e., one that you didn't specify for that group.)
    3. If it asks you to "Set fill character," just type "/" and say ok. That means that if you have any token string that isn't long enough (you specified the length at the beginning with "Select number of groups") it will fill in "/"'s to fill out the line-- then you'll see them and check what you missed.

    [Return to top] [Return to syllabus]

    Part 7: Distributional analysis using Goldvarb

    First, make sure your Token file (from Part 6) is open in Goldvarb.

    Prepare the Conditions file

    This is where you choose how to sort your data. For starters, you do a general sort of all your data, to show the distribution of each dependent variable with respect to each variant of the independent variable.

    1. In the Token menu, choose No recode. This will give you a general overview of the tokens and patterns you have.
    2. Save your new Conditions file (with .cnd suffix).
    3. Name it something that will make sense like "all" or "first" or "Vanessa."
      Add the suffix ".cnd" to the name, if it doesn't automatically appear.
      Note that the program suggests an appropriate name, but it will always suggest a name that is the same as your token file name, and you will probably make several different condition files from your token file.
    4. Watch it generate a list of conditions, written in Lisp programming language. Your condition file will look something like this:
    5. (
      ; Identity recode: All groups included as is.

    This means: "Didn't do anything to any of the groups. Just use them all (elements (1), (2) and (3) of each token) as they are, with the stuff in column (1) as the dependent variable (because it appears first).

    Create cell and result files

    1. In the Cell menu select Load cells to memory.
    2. Click "OK" when it asks whether to use the tokens and condition files that you see on the screen.
      This won't work if there is anything wrong with your token file or conditions file.
      A cell file is created, which you can ignore.
      A Results file is also created.
    3. You will be asked to select the application values. This means, "how will the dependent variable be examined?" The possible variants of the dependent variable are listed. You can rearrange their order and/or erase some of them. Assume the window shows you the string "YN?".
    To go on to the statistical analysis, you must select one of the binary comparison options (listing either one or two variants).
    Save the Results file under an appropriate name (LIN351_Last-Name_First-Name_1.Res).

    The Results file

    This file shows you the distribution of the dependent variable with respect to each variant of the independent variables. It looks like this:

    .Res file in Goldvarb
    When you created it
    . tkn and .cnd files uses
    Conditions you selected
    Summary of the distribution
    Details of the distribution of variants of the dependent variable (listed across the top) for each independent variable (listed along the left side)

    This table lists the possible variants of the independent variable (Y and N) across the top of the table and the possible variants of each dependent variable, one per row. So this Results file compares tokens with "Y" vs. "N" as the independent variable, i.e., deleted vs. non-deleted (t,d). Note: You will not use “Y” and “N” for your dependent variable.
    The first independent variable examined is following segment, with "C" for following consonant and "V" for following vowel. It shows that 67% of the words with a following consonant had deleted (t,d), but 0% of the words with a following vowel had deleted (t,d).
    The second independent variable examined is Speaker, with "P" for Pat and "N" for Naomi (pseudonyms, of course). We see that Speaker N deleted (t,d) in 33% of her tokens and Speaker P in 100%
    The word * KnockOut * appears in every line that has a "0" value in it.
    Finally, we see that overall, for the whole token set, (t,d) deletion occurred 50% of the time, or in 2 out of 4 tokens.

    Note: You can copy and paste this table into a Word document, such as a research paper. The columns will line up if you choose the Courier font. It will have strings of spaces rather than tabs, which can be a pain to edit, but you can fix it up as necessary. You can also edit it right in the Results window, which can be dangerous. But, if you do mess it up, you can also reconstruct a new results file by going back to the Cell menu and selecting Load cells to memory.

    [Return to top] [Return to syllabus]

    Part 8: Report the overall distribution of the dependent variable in the data

    1. Provide an overall distribution of the dependent variable in the data, as per the following template. Format and label it as in this example:
    2. Table 1: Overall distribution of (r): Linguistic variables (6 speakers)
      Linguistic variables % [r-1] N [r-1] Total # of tokens in the category
      Preceding vowel

      high front


      mid front





      Following context

      Following C, in the next morpheme but the same syllable


      Following V




    3. Create a second table, similar to Table 1, but that provides a distributional analysis for the social variables.

    Note: the numbers in these table templates are made up! You will have to replace them with your own numbers and your own categories, depending on what you find in your data.

    Submit Tables 1 and 2 as well as a print-out of your Goldvarb .Res file (for Part 8). Use the same type of naming convention as for the previous assignment.

    Note: If you don't have a printer and GoldVarb on the same computer, you may need to COPY the contents of your Results file to a Word .doc or some such, save it, and take/send it to a computer with a printer. (Because you won't be able to easily open a .Res file on a computer where Goldvarb isn't installed.)

    So, far, you have been doing univariate analysis – looking at only one independent variable at a time. In the next Part, you will learn to how to conduct an analysis with more than one independent variable included. This is very important when your data set does not have a balanced distribution of every combination of every independent variable. That is, when you are dealing with real world data.

    [Return to top] [Return to syllabus]

    Part 9: Factor Weights

    Preparing for multivariate analysis

    For this part of the project, you want to conduct an analysis that examines the tokens from all speakers together. You need to use a token file that has the tokens for all six speakers in it.

    Use [r-1] as the application value. Be sure to properly label what is being counted, in your tables.

    In order to find out which variables are significant, you must create a results file with no "Knockouts, " i.e., no "0" values. This may mean combining or deleting certain factors or factor groups. This process is done by creating a new Conditions file.

    Make sure you have principled reasons for the changes you make. This means that it's ok to combine, for example, following stop and following fricative if there was 95% deletion for stops and 100% for fricatives, because (a) stops and fricative are somewhat similar phonetically and (b) the numbers were fairly similar.

    Select Recode setup from the Tokens menu. First, copy over the dependent variable from the left to the right side, using the Copy button. Then, for any factor groups that you wish to leave intact, select them on the left side (clicking on their factor group number) and then click Copy.

    To exclude a certain factor, click on it on the left side, then click Exclude and say ok. Then copy the factor group. Although the excluded token will still show up, tokens containing it will be ignored in the analysis.

    To recode a factor group (normally to combine two categories that were coded separately), select it, choose recode, and then type over the letters on the right to show how you want to recode them.

    Make sure it worked. Do this by going back to the Create cells step and creating the distribution tables again and making sure that there are no knock-outs. If it didn't work, try a new condition file. This may get tedious, so you might copy down your coding from the Conditions window. You should be able to figure out most of the Lisp code.

    See more information in the "How to recode" document in Bb (or download here; you'll need the sample .tkn file, too).

    Multivariate analysis

    After producing the usual (distributional) results by selecting Load cells to Memory, choose Binomial, one-level from the Cells menu.

    This will create a table showing the Factor Weight of each factor, in addition to the percentages (App/Total). It looks pretty much like all the tables of weights and probabilities you've seen in various articles, but to get the number of tokens, you need to scroll back up to your distributional results. The weights are the values for p1, p2, etc., in the logistic equations we looked at, representing the effect that each factor has on whether the rule applies. (The factor weights are for the "Application value," which is whichever value of the dependent variable prints in the first column of the distribution table that you will have just made.)
    It will also show the frequency (percentage) for each factor, for that same Application value. (We will ignore than Input&Weight column.)

    This report also gives an Input value, which is the overall probability (po) of the application value occurring. That should always be reported.

    So for any one combination of factors (e.g., following consonant, Pat as speaker) we could calculate the probability of deletion by combining the po value with the appropriate p1, p2, for each factor group.) But, we don't have to do it because Goldvarb does it for us-- those weights combined will equal the probability of a certain type of token undergoing the rule application.

    If you want to look at the constraints conditioning variation for only a subset of your tokens, for example, only tokens from younger speakers, you can use the Recode set-up to exclude tokens from the older speakers. Experiment with this.

    Choose Binomial, up & down for analysis of which factors are significant. (You don't need to do this step for the assignment. It's more relevant when you have more factors.)

    This analysis takes longer than the one-level, if you have a big token file.
    It spits out a lot of text and numbers, but indicates which factors are significant.
    You will notice that, generally, the factor groups it finds to be significant are those that have the biggest spread in values in the one-level analysis. If there is a lot of overlap between two different factor groups (i.e., all tokens with a following vowel having been produced by Pat), there may be differences.


    If you want to look for that type of overlap, or any interaction between 2 independent variables, choose Cross-tab from the Cells menu and select the 2 groups you are interested in. A new table will be created which shows you their distribution. You can save it as Text or Picture. The "Picture" one looks nicer when copied into another document, but can't be edited, and takes up more disk space. The "Text" one can be edited, and has to be, in order to be legible. Use Courier font to make the columns line up.

    For Part 9, submit a document containing the Results file for a one-level binomial analysis for all speakers combined, as Table 3. Be sure to label all columns and rows clearly. It should be arranged like Factor Weight tables we've seen in a number of articles this semester.

    See the note above about printing issues.

    [Return to top] [Return to syllabus]

    Part 10: Comparative analysis

    Submit a brief report (1-2 pages of prose) answering the following questions. Include the necessary tables from your own analysis, clearly labeled.

    Compare your results with the results presented in Nagy & Irwin (2010). What’s the same? What’s different?

    As you do this, think about things like:

    NOTE: The various files shown as examples are nonsense, un-related, and/or not created from real data.

    [Return to top] [Return to syllabus]

    Updated March 26, 2012