COREFL: Corpus of English as a Foreign Language


Oct. 2021

Transcription conventions

Transcription of the spoken data

The transcripts are orthographic transcriptions and include only basic details of spoken language properties. For example, pauses are marked but their length is not marked. Spoken language features marked in the transcripts include pauses, false starts, incomprehensible words, etc. The idea is that the transcription should be as legible as possible by a wide range of users.

In this version of COREFL (version 1), transcriptions are provided only when the spoken texts are in English (i.e., English native subcorpus, L2 English learner corpora) or in Spanish (i.e., Spanish native subcorpus).

Table: Transcription convention

Phenomenon Code Comment Examples
Empty pauses / Only for very obvious pauses with a clear flat line in the waveform, independently of their length. A pause may coincide with a clause boundary (i.e., the end of a clause) but often it does not.
(1) but the deer is none too thrilled that / the boy is on top of him
(2) One day there was a boy who really loved animals / he had a bird
Filled pauses uh (English)
eh (Spanish)
The sound produced in the filled pause may be of different kinds, including uh, eh, er, em, erm, etc. the boy looks into the tree uh doesn't see the frog there
Non-linguistic sounds hhh Unspecified non-linguistic occurrence, which can include: laughing, coughing, clearing one’s throat, sighing, deep breathing. he climbed hhh a mountain
Incomprehensible or unintelligible word(s) xxx Used to mark an unintelligible word or passage by the transcriber and now it's not just a frog but it's xxx entire family
False starts and Cut-off words = The symbol marks a false start or a cut-off word and is inserted immediately after the unfinished word. he sca= scared
Repetitions They are not tagged or marked in any way. The transcription simply reflects what the speaker says. Repetitions can be repeated words or multiple words. (1) also he gets angry and decided to to make
(2) they could see they could see
Rewordings or Reformulations They are not tagged or marked in any way. The transcription simply reflects what the speaker says. (1) was contemplating a the day's capture
(2) so they go into the wall into the forest
Capitalization Capital letters are used for proper names and for acronyms. Charles Chaplin London USA
Sound lengthening Lengthened phonemes are not transcribed or annotated in any way.
Intonation and punctuation The transcription does not include any of the standard punctuation used in written language, like a full stop (.) to mark the boundary between sentences, or a comma (,) to indicate a pause or a question mark (?) to indicate a rising/falling intonation.
Foreign word(s) and Codeswitches They are transcribed as such.
Contractions English: contractions are transcribed as such. Spanish: no contractions used.