Transcription conventions
Transcription of the spoken data
The transcripts are orthographic transcriptions and include only basic details of spoken language properties. For example, pauses are marked but their length is not marked. Spoken language features marked in the transcripts include pauses, false starts, incomprehensible words, etc. The idea is that the transcription should be as legible as possible by a wide range of users.
In this version of COREFL (version 1), transcriptions are provided only when the spoken texts are in English (i.e., English native subcorpus, L2 English learner corpora) or in Spanish (i.e., Spanish native subcorpus).
Table: Transcription convention
Phenomenon | Code | Comment | Examples |
---|---|---|---|
Empty pauses | / | Only for very obvious pauses with a clear flat line in the waveform, independently of their length. | A pause may coincide with a clause boundary (i.e., the end of a clause) but often it does not. (1) but the deer is none too thrilled that / the boy is on top of him (2) One day there was a boy who really loved animals / he had a bird |
Filled pauses | uh (English) eh (Spanish) |
The sound produced in the filled pause may be of different kinds, including uh, eh, er, em, erm, etc. | the boy looks into the tree uh doesn't see the frog there |
Non-linguistic sounds | hhh | Unspecified non-linguistic occurrence, which can include: laughing, coughing, clearing one’s throat, sighing, deep breathing. | he climbed hhh a mountain |
Incomprehensible or unintelligible word(s) | xxx | Used to mark an unintelligible word or passage by the transcriber | and now it's not just a frog but it's xxx entire family |
False starts and Cut-off words | = | The symbol marks a false start or a cut-off word and is inserted immediately after the unfinished word. | he sca= scared |
Repetitions | They are not tagged or marked in any way. The transcription simply reflects what the speaker says. Repetitions can be repeated words or multiple words. | (1) also he gets angry and decided to to make (2) they could see they could see |
|
Rewordings or Reformulations | They are not tagged or marked in any way. The transcription simply reflects what the speaker says. | (1) was contemplating a the day's capture (2) so they go into the wall into the forest |
|
Capitalization | Capital letters are used for proper names and for acronyms. | Charles Chaplin London USA | |
Sound lengthening | Lengthened phonemes are not transcribed or annotated in any way. | ||
Intonation and punctuation | The transcription does not include any of the standard punctuation used in written language, like a full stop (.) to mark the boundary between sentences, or a comma (,) to indicate a pause or a question mark (?) to indicate a rising/falling intonation. | ||
Foreign word(s) and Codeswitches | They are transcribed as such. | ||
Contractions | ’ | English: contractions are transcribed as such. Spanish: no contractions used. |