Chapter 2 Quantitative data and analysis in Linguistics

Many linguistic questions can be tested using quantitative data: things that can be measured, counted, rated, or categorized.

How does the f0 (pitch) of a speaker’s voice differ between casual vs. formal speech?
How quickly do listeners process frequent vs. infrequent words?
How often is a morphological variant (e.g. -in vs. -ing) used by younger vs. older speakers?
How does perceived grammaticality differ for sentences with Object-Subject word order vs. Subject-Object word order?
What consonant (/b/ or /p/) is perceived by listeners when hearing sounds that differ in VOT?
Does speech rate in a second language vary based on the speaker’s amount of experience with the language?

Quantitative analysis is the process of trying to answer questions like these, or more generally, to interpret quantitative data, and includes:

Summarizing data: Describing distributions of data in meaningful terms
Discovering relationships: Examining patterns, such as differences between groups and associations between variables
Making inferences: Generalizing from a sample

Quantitative linguistic research: The process

TOPIC	What is your research question (general and specific)? What are your hypotheses?
METHODS	Choose appropriate method (experiment? corpus?). Decide how you will evaluate your hypothesis
ANALYSIS	Perform measurements. Present graphs of your data. Perform statistical analysis
CONCLUSION	Prose summary of results. Was your hypothesis supported?

2.1 Types of data

How we choose to analyze data will differ depending on what type of data we are working with. One important distinction is between continuous and categorical data:

Continuous data: a measure that has meaningful numerical values. Examples:
- speakers’ f0 (measured in Hz)
- listeners’ reaction times (measured in ms)
- length of utterance (measured in number of words)
Categorical data: a measure that has values that belong to distinct categories. Examples:
- speaker’s first language (Hindi, Somali, Thai, Cree…)
- listeners’ accuracy on a perception task (correct vs. incorrect)
- syntactic construction (e.g. whether an utterance has SOV or VSO word order)

Usually, we will have a set of values, or a distribution of data, that we want to summarize, and the way to do this differs depending on the type of data. For example, if we want to summarize continuous data from a group of 10 speakers, such as the f0 (pitch), we can measure the f0 of each speaker in Hz, get a set of numbers like [240, 260, 130, 200, 300, 300, 321, 122, 132, 178] and take the average of these 10 numbers. On the other hand, if we want to summarize categorical data, such as the first language of the speakers ["Hindi", "Somali", "Tagalog", "Cree", "English", "Mandarin", "English", "Mandarin", "Hindi", "Hindi"], we can’t do the same thing, because it’s not possible to take an average a set of non-numbers! Instead, we have to find a different way to summarize this in a succinct way, for example by reporting the proportion of speakers who have each L1.

2.2 Variables in research design

The sample questions presented at the beginning of this chapter all had to do with things that can take on different values, whether it is a continuous value (e.g., the rate of speech, measured in words per second) or a category (e.g., word order, which could be SOV vs. OVS vs. VSO etc.). We can call these “variables” (just like x in your high school algebra class).

In linguistic research, we often want to examine how the distribution of one variable varies in different circumstances. We can think of this as a relationship between two variables:

Outcome variable: what we’re measuring (also known as dependent variable or response variable). This is the thing we expect to depend on, or respond to changes in the predictor variable.
Predictor variable: the dimension along which we expect the outcome variable to differ (also known as independent variable). This is the thing we expect to predict differences in the outcome variable.

For categorical variables, we will also often want to specify the levels, or potential categories.

As an example, for the first question at the beginning of this chapter, “Does the f0 (pitch) of a speaker’s voice differ between casual vs. formal speech?”, we are interested in whether f0, a continuous variable, differs across two different styles of speech. The outcome variable here is f0, and the predictor variable is speech style, which has two levels: casual and formal.

Consider the other sample questions at the beginning of the chapter. For each, what is the outcome variable and what is the predictor variable (and for those variables that are categorical, what are the levels)?

2.3 Formulating research questions

While much of this class will be focused on trying to answer research questions, actually coming up with research questions is an important and deceptively difficult task.

Properties of good research questions

Target underlying curiosities: What are we interested in knowing?
Contextualized within previous work: What have other people answered and found about about this or related questions?
Embedded in a theory about how the world works (in our case, usually a theory about language).
Contribute to a larger research endeavour: small questions add up.
They are testable (there is a concrete possible answer space).

We almost never answer research questions directly; instead, we provide evidence for answers with data.

Data can be corpus data, articulatory or acoustic measures, results of perception or production experiments, languages users’ judgments
We are usually looking at how to explain the patterning of some outcome variable, usually by relating it to another variable (outcome and predictor variables, above).

2.3.1 General vs. specific research questions

Usually, questions of theoretical interest are quite general, and too big to answer with a single experiment. Note that some of the examples of “general” questions below might sound pretty specific. However, if you think about it, there are a lot of different ways you could go about answering each question.

It’s important to always have a general question in mind, but when doing research, it is critical to additionally formulate a specific, testable research question that describes exactly what you will be doing to test the more general question. One way to make sure the question is specific is to formulate it in terms of the specific outcome and predictor variables you will be using.

Below show a couple example of a continuum of generality in questions. Note that only the specific question involves the specific outcome and predictor variables.

Very general question	General-ish question	Specific question
Is there interaction between the sound system of bilinguals?	Do French-English bilinguals produce English stops differently than English monolinguals?	Does the VOT of English voiceless stops differ systematically between bilingual French-English and monolingual English speakers?
Do listeners show perceptual accommodation to dialectal variation?	Do listeners show increased acceptability for a syntactic construction after spending time in a dialect region where it is used?	Do grammaticality ratings for the “I’m done my homework” construction common in Canadian English differ for international students who have just arrived in Canada vs. those who have lived there for 4 years?