COREFL: Corpus of English as a Foreign Language


Oct. 2021

Transcription conventions

Transcription of the spoken data

The transcripts are orthographic transcriptions and include only basic details of spoken language properties. For example, pauses are marked but their length is not marked. Spoken language features marked in the transcripts include pauses, false starts, incomprehensible words, etc. The idea is that the transcription should be as legible as possible by a wide range of users.

In this version of COREFL (version 1), transcriptions are provided only when the spoken texts are in English (i.e., English native subcorpus, L2 English learner corpora) or in Spanish (i.e., Spanish native subcorpus).

Table: Transcription convention

Phenomenon Code Comment Examples
Empty pauses / Only for very obvious pauses with a clear flat line in the waveform, independently of their length. A pause may coincide with a clause boundary (i.e., the end of a clause) but often it does not.
(1) but the deer is none too thrilled that / the boy is on top of him
(2) One day there was a boy who really loved animals / he had a bird
Filled pauses uh (English)
eh (Spanish)
The sound produced in the filled pause may be of different kinds, including uh, eh, er, em, erm, etc. the boy looks into the tree uh doesn't see the frog there
Non-linguistic sounds hhh Unspecified non-linguistic occurrence, which can include: laughing, coughing, clearing one’s throat, sighing, deep breathing. he climbed hhh a mountain
Incomprehensible or unintelligible word(s) xxx Used to mark an unintelligible word or passage by the transcriber and now it's not just a frog but it's xxx entire family
False starts and Cut-off words = The symbol marks a false start or a cut-off word and is inserted immediately after the unfinished word. he sca= scared
Repetitions They are not tagged or marked in any way. The transcription simply reflects what the speaker says. Repetitions can be repeated words or multiple words. (1) also he gets angry and decided to to make
(2) they could see they could see
Rewordings or Reformulations They are not tagged or marked in any way. The transcription simply reflects what the speaker says. (1) was contemplating a the day's capture
(2) so they go into the wall into the forest
Capitalization Capital letters are used for proper names and for acronyms. Charles Chaplin London USA
Sound lengthening Lengthened phonemes are not transcribed or annotated in any way.
Intonation and punctuation The transcription does not include any of the standard punctuation used in written language, like a full stop (.) to mark the boundary between sentences, or a comma (,) to indicate a pause or a question mark (?) to indicate a rising/falling intonation.
Foreign word(s) and Codeswitches They are transcribed as such.
Contractions English: contractions are transcribed as such. Spanish: no contractions used.

This website uses own and third party cookies to allow it to work fine and to allow us to know how it is being used. If you click on ACCEPT these both types of cookies will be enabled. If you want more information, you can read the COOKIES POLICY document of our website. You can change your settings by clicking on Cookie settings

Technical cookies So that our website can work. Activated by default.

Technical cookies are strictly necessary for our website to work and for you to navigate through it. These types of cookies are those that, for example, allow us to identify you, give you access to certain restricted parts of the website if necessary, or remember different options or services already selected by you, such as your privacy preferences. Therefore, they are activated by default and your authorization is not necessary.

Through the configuration of your browser, you can block or alert the presence of this type of cookies, although such blocking will affect the proper functioning of the different functionalities of our website.

Analysis cookies To allow us to know how our web is being used. You can enable or disable them.

Analysis cookies allow us to study the navigation of the users of our website in general (for example, which sections of the site are the most visited, which services are used most and if they work correctly, etc.). From this statistical information about navigation on our website, we can improve both the operation of the site itself and the different services it offers. Therefore, these cookies do not have an advertising purpose, but only serve to make our website work better, adapting to our users in general. By activating them you will contribute to this continuous improvement.

You can activate or deactivate these cookies by changing the corresponding sliders.