Naomi Nagy

Linguistics at U of T

Extracting Tokens from Transcribed Interviews (using Word)

These directions use the example of extracting tokens of subordinate and relative clauses from interviews transcribed in MS Word. (See tips on transcribing and using Express Scribe).

An alterative is to transcribe, extract and code using ELAN.

If you are working in Word (or some other word processor:

One method is to mark all tokens to be extracted in one color, say purple.

Save the document under a new name (e.g., Karl_EN_subord.doc) so you always have the full transcript to go back to.

Make sure format is 12 point, Times New Roman, 1 1/2 spacing.

Change color of all the sentences with a subordinate or relative clause, starting at the top of page 5 (we think speakers might speak rather unnaturally at the very beginning of the interview).

Starting FROM THE END, delete all text after the last colored sentence.

Type in a return and then the page # of the last colored sentence right before the sentence, followed by a tab.

Then delete all material between the last colored sentence and the 2nd to last, and repeat the page # process.

(You have to start from the end with this, otherwise the page #'s will change as you delete material.)

Leave in all colored sentences, and all indications such as "end of tape 1" or indicators of time units (if there are any). Everything else (black text) gets deleted.

Then save the document and open a new blank document (e.g., Karl_EN_rel.doc).

Then cut all the relative clauses from one document and paste them into the other, leaving the subord. clauses behind. Copy all helpful clues like "end of tape 1" or time units into both.

You could then use square brackets to mark each subord/rel clause. (This would be necessary for automatic extraction using GoldSearch.)

Now Karl_EN_subord.doc looks something like this:


(Start of Tape 1)

3 I think [that this is cool]

3 My mom said [that this is cool]

3 My dad said [it's not]

4 I know [that you are talking about something confusing]


Karl_EN_rel.doc looks something like this:

2 That's the book [I like]

3 He's the singer [I saw]


There should be a tab after each page mark and a return after the end of each sentence. If a sentence is longer than 1 line, that's ok. It will all format itself nicely when copied into Excel.

If a sentence has 2 clauses in it, copy it in twice. Imagine that Karl said:

"I think that you said that this is the man I saw."

This will become:

in the file Karl.EN.rel.doc:

"I think that you said that this is the man [I saw]."

in the file Karl.EN.subord.doc:

"I think [that you said that this is the man I saw.]"

"I think that you said [that this is the man I saw.]"

email: naomi dot nagy at utoronto dot ca | Return to my home page