COREFL: Corpus of English as a Foreign Language


Oct. 2021


COREFL follows the same design principles as those of the CEDEL2 corpus.

In COREFL we investigate how people learn English. That is why we collected a large database (=corpus) of written (and some spoken) texts produced by learners of English. This is called a ‘learner corpus’ or ‘L2 corpus’.

The corpus is intended to be beneficial for linguists, researchers and teachers/learners of English, as well as those interested in other uses of learner corpora (computational linguists, course material designers, etc).

Corpus description and subcorpora

COREFL (version 1) is a large corpus that contains samples of the language produced by learners of English as a second/foreign language. For comparative purposes, it also contains a native control subcorpus of the language produced by native speakers of English from different varieties (American English and British English mainly, with a few samples of other varieties), so it can be used as a native corpus in its own right. It contains an additional native control subcorpus: native speakers of Spanish, which is the mother tongue of a large part of the L2 learners. Therefore, at this stage of COREFL (version 1), we have the following subcorpora:

Table: Native control subcorpora in CEDEL2 v.2

Native control subcorpus 1
(learner's mother tongue)
Learner subcorpus Native control subcorpus 2
(learner's target language)
L1 Spanish L1 Spanish-L2 English L1 English
L1 German [under development] L1 German-L2 English L1 English

Spoken texts currently amount to about a third of the entire corpus (in words). However, an important feature of the spoken texts in COREFL is that every spoken text can be matched to a written text since they have been produced by the same participant, who did the same task twice: the written text was produced first and then, after at least 15 days (so as to avoid task-repetition effects), the spoken text was produced. In this way, researchers can investigate the effects of medium (spoken vs. written) while maintaining the learner and the task as constant.

The L1 German native control subcorpus is currently under development. In future versions of COREFL, we will expand it by adding additional L2 corpora and their corresponding native control subcorpora in such a way that there is a control subcorpus type 1 for every learner subcorpus.

Corpus design: Tasks

The four tasks used in COREFL are a selection of the tasks originally used in the CEDEL2 corpus. In particular, COREFL uses only tasks number 2 (famous person), number 3 (film), number 13 (frog) and number 14 (Chaplin), which are shown in bold in the table below. Importantly, tasks are not associated with any particular proficiency level, i.e., learners of all levels have participated in most/all tasks.

Table: COREFL tasks (only those shown in bold have been used for the current version of COREFL)

Task number Task title Task description
1 Region where you live What is the region where you live like?
¿Cómo es la región donde vives?
2 Famous Person Talk about a famous person.
Habla de una persona famosa.
3 Film Summarise a film you have seen recently.
Resume una película que has visto recientemente.
4 Last year holidays What did you do during your holidays last summer?
¿Qué hiciste el año pasado durante las vacaciones?
5 Future plans What are your plans for the future?
¿Cuáles son tus planes para el futuro?
6 Recent trip Describe a trip you have recently made.
Describe un viaje que has hecho recientemente.
7 Experience Talk about an experience you have recently had.
Cuenta una experiencia que hayas vivido.
8 Terrorism Talk about the problem of terrorism in the world.
Habla del problema del terrorismo en el mundo.
9 Anti-smoking law What do you think about the new anti-smoking law?
¿Qué opinas de la nueva ley anti-tabaco?
10 Gay couples Do you think gay couples should have the right to get married and adopt children?
¿Crees que las parejas gay tienen el derecho de casarse y adoptar niños?
11 Marijuana legalization Do you think marijuana should be legal?
¿Crees que la marihuana se debería legalizar?
12 Immigration Analyse the main aspects concerning immigration.
Analiza los principales aspectos de la inmigración.
13 Frog Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “Un día / One day...
14 Chaplin Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once.

Corpus design: Variables

As done in the CEDEL2 corpus, COREFL was designed with a second language acquisition (SLA) agenda in mind. For every participant, we collected a large number of variables that are essential for SLA researchers. There are two sets of variables: linguistic background variables and task variables.

Table: Learner’s variables (linguistic background and task)

Linguistic background variables Task variables
  1. L1 of the learner
  2. L1 of the learner’s father
  3. L1 of the learner’s mother
  4. Language(s) spoken at home
  5. Placement test score (1-43 points)
  6. Proficiency level (A1 lower beginner, A2 upper beginner, B1 lower intermediate, B2 upper intermediate, C1 lower advanced, C2 upper advanced)
  7. Proficiency level self-evaluation on each skill in Spanish (speaking, listening, writing, reading).
  8. Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading).
  9. English language certificates held, if any
  10. Sex
  11. Age
  12. Age of exposure to L2 English (AoE)
  13. Years studying English (Length of Instruction, LoI)
  14. Stays in English-speaking countries? (yes/no):
  15. Stay(s): Where?
  16. Stay(s): When? (period(s) of residence)
  17. Stay(s): How long? (length of residence)
  18. School/University/Educational institution (if any)
  19. Major degree (if any)
  20. Year at university/school (if any)
  1. Task title
  2. Task text (written text/spoken text transcription/audio file)
  3. Approximate time to produce the task (in minutes).
  4. Where was the task done? (in class/outside class/both)
  5. Resources used to produce the task (help from English native/bilingual dictionary/monolingual dictionary/spellchecker/grammar book/background readings/none)

Table: Native’s variables (linguistic background and task)

Linguistic background variables Task variables
  1. L1 of the native speaker
  2. L1 variety
  3. L1 of the native speaker’s father
  4. L1 of the native speaker’s mother
  5. Language(s) spoken at home
  6. Proficiency level self-evaluation on each skill in foreign language (speaking, listening, writing, reading).
  7. Proficiency level self-evaluation on each skill in additional foreign language (speaking, listening, writing, reading).
  8. Sex
  9. Age
  10. School/University/Educational institution (if any)
  11. Major degree (if any)
  12. Year at university/school (if any)
  1. Task title
  2. Task text (written text/spoken text transcription/audio file)
  3. Approximate time to produce the task (in minutes).
  4. Resources used to produce the task (Monolingual dictionary/Spellchecker/Grammar book/Background information about the topic of your text (TV, internet, magazines, books...)).

Corpus design: Proficiency level

COREFL contains data from learners of English at all proficiency levels (beginner, intermediate, advanced). Unlike other learner corpora that do not contain a standardised measure of the learners’ proficiency, COREFL uses three proficiency-level measurements:

(1) Objective measurement: Learners were administered a standardised placement test, which objectively measures their proficiency level. Learners were classified into 6 levels according to CEFR (A1, A2, B1, B2, C1, C2):

CEFR level Proficiency level Oxford Quick Placement Test1 scores (and corresponding %) Cambridge Unlimited Placement Test2 scores (and corresponding %)
A1 Lower beginner 0-17 (0%-28.3%) 0-35 (0.0%-29.2%)
A2 Upper beginner 18-29 (30.0%-48.3%) 36-55 (30.0%-45.8%)
B1 Lower intermediate 30-39 (50.0%-65.5%) 56-75 (46.7%-62.5%)
B2 Upper intermediate 40-47 (66.7%-78.3%) 76-95 (63.3%-79.2%)
C1 Lower advanced 48-54 (80.0%-90.0%) 96-120 (80.0%-100%)
C2 Upper advanced 55-60 (91.7%-100%)
1 Oxford University Press. (2003). Quick Placement Text. Oxford University Press. (Paper and pen, version 1, freely available online) -->This test was administered to most L2 English learners. The test raw scores range from 0 to 60.
2 Cambridge University Press. (2010). English Unlimited Placement Test. Cambridge University Press. (Paper and pen version, freely available online) --> This test was administered to L2 English learners with low proficiency level: younger learners (secondary school) and university students majoring in a modern language other than English. The test raw scores range from 0 to 120.

(2) Subjective measurement: Learners self-rated their proficiency in English for each of the four skills (speaking, listening, reading, writing) according to a six-point ordinal scale. The subjective measurement for each skill is then transformed into a 1-6 numeric scale and a new variable is created called ‘Proficiency self-assessment’, which is an average of the four observations. For example, suppose a learner self-rates their English as follows: speaking A1, listening B1, reading A2, writing A1. These ordinal values are transformed into their corresponding numeric values: 1, 3, 2, 1. The final average for the variable ‘proficiency self-assessment’ is 1.75 (out of a maximum of 6).

Self-rating ordinal scale Corresponding numeric value
Lower beginner (A1) 1
Upper beginner (A2) 2
Lower intermediate (B1) 3
Upper intermediate (B2) 4
Lower advanced (C1) 5
Upper advanced (C2) 6

(3) Language certificate measurement: Additionally, learners report on any English language certificates they may hold (e.g., First Certificate B2). Finally, learners also report on any other additional foreign languages they know (other than English) and self-rate themselves on each of the skills according to the 6-point subjective scale above.

Data collection

Written data: Written data were collected via online forms, which means that participants could participate from anywhere in the world. To ensure that all participants understood the forms and the instructions correctly, forms were written in their native language. To see the different forms, please visit the data-collection website The written data are marked in the corpus files as ‘WRITING/AUDIO DETAILS: written_online’.

Some written texts from lower-level learners (secondary schools and university learners majoring in a modern language other than English) were originally handwritten and then transcribed into digital format. These are marked in the corpus files as ‘WRITING/AUDIO DETAILS: written_offline_classroom’.

Spoken data: Spoken data were collected in situ at the Universidad de Granada. There were 3 spoken data collection methods:

  1. Most spoken data were recorded in a quiet room with the help of special audio recording equipment (Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa) to guarantee the best quality of sound possible in the audio files. This ensures that phoneticians can download the audio files to perform fine-grained acoustic analyses. These are marked in the corpus files as: ‘WRITING/AUDIO DETAILS: spoken_offline_lab’.
  2. Some spoken data were recorded by a researcher in a classroom with a laptop computer in a face-to-face format at the Universidad de Granada. These are marked in the corpus as: ‘WRITING/AUDIO DETAILS: spoken_offline_classroom’.
  3. Due to the Covid-19 pandemic, during 2020 and 2021 some spoken data were collected in an online face-to-face format via the Google Meet software. These are marked in the corpus files as ‘WRITING/AUDIO DETAILS: spoken_offline_googlemeet’.
  4. A few audios were self-recorded by the participants in their laptops and were then uploaded onto the corpus database. These are marked in the corpus files as ‘WRITING/AUDIO DETAILS: spoken_online’.

For consistency, a protocol was followed by all data collectors during oral recordings. Finally, the audio files were transcribed orthographically and converted into text files which are searchable and downloadable via the corpus web interface (tab ‘Search/Download’). See the transcription conventions in the tab ‘User Guide’ > ‘Transcription conventions’. The actual audio files can be downloaded.

COREFL in context

COREFL was designed following the protocols and design principles of the CEDEL2 corpus. It is in line with other international projects where large learner corpora are being created, such as ICLE (International Corpus of Learner English) (see a full list at the Catholic University of Louvain Learner Corpora Around the World). COREFL was inspired by WriCLE (Written Corpus of Learner English), which is a freely-available L1 Spanish - L2 English university-level corpus that we developed in earlier projects.