Naomi Nagy

Linguistics at U of T



created by David Boas, Miriam Meyerhoff,
and Naomi Nagy

General description

Much data-driven linguistic research relies on coordinating data of two types:
  • a linguistic corpus (a collection of speech or writing from a number of sources or speakers) that has been tagged, or marked up, to allow researchers to identify linguistic features of interest to them, and
  • a record of the characteristics of the speakers or writers contributing to the corpus (sometimes also including the context of the recording)
In order to discover the patterns of linguistic variation and language use in the corpus, it is necessary to examine how the language in the corpus varies according to the different individual, social, and linguistic conditions also encoded in our corpus. For this purpose, we compare the frequencies of variants of a dependent linguistic variable across the (putatively) independent variables: speaker, context, and linguistic environments.

A common method for this purpose has been to extract each occurrence (token) of the linguistic variable from the corpus one-by-one (or speaker-by-speaker), and then list the codes associated with each speaker in a separate file. This file can then be analyzed by the programs Varbrul or Goldvarb X (created by David Sankoff et al. and David Rand, and updated by David Sankoff, Sali Tagliamonte and Eric Smith). These programs tally up the number of occurrences of each combination of factors (cells). Varbrul then allows univariate and multivariate analyses of the interactions of factors coded in the data.

Goldsearch automates the first (and most time-consuming) step by creating lists of all occurrences of tokens illustrating each variant of the dependent variable. It does this by treating the corpus file and the file containing information about each contributor to the corpus as linked databases. The feature unique to Goldsearch, that commercial database programs do not seem to offer, is the ability to conduct iterative searches in one of the files while maintaining an active link with the other. This means that when you conduct a search in one file, every token that the program finds is referenced to the information in the other file. This information is used to create two new files. One is a list of all tokens matching the search string that were found in the corpus. The other is a cumulative list of the independent factors associated with the tokens found in the linguistic corpus. This output file is a text file ready to be analyzed by Varbrul. In addition, a raw count of the tokens found for each contributor to the corpus is shown at the end of each search run.

To summarize,

  • this application allows you to perform a search of a (bracketed and tagged) corpus.
  • It records the occurrences of a certain type of token (those which match the search string) for each speaker and setting.
  • It counts and codes the number of occurrences of each type of token (or search string) for each speaker
  • It produces an output list of the matches in the form of a list of strings of independent factors associated with each speaker or turn, ready to use as a token file for Goldvarb.

Features of GOLDSEARCH

Getting started

Defining the search

Document requirements

System requirements


Input and output

How to get Goldsearch from the WWW

For further information, or to suggest improvements or report problems, send e-mail to naomi dot nagy at utoronto dot ca

To learn more about the creators of this application, look at the home pages of:

Please address questions or comments to Naomi Nagy.
email: naomi dot nagy at utoronto dot ca | Return to my home page