COREFL: Corpus of English as a Foreign Language

COREFL

v1.0
Oct. 2021

Metadata

You can download the texts with metadata (as in the example below) or without the metadata (i.e., only the text produced by the participant). For each downloaded file, you will see the corresponding metadata to the left and its values to its right, as in the following example for an English learner of Spanish:

Subcorpus: Learners
Filename: ES_WR_C2_47_39_3_CGL
Year data collection: 2017
Placement test score (raw): 57 / 60
Placement test score (%): 95
Proficiency: Upper advanced
Sex: Male
Age: 47
School/University/Institution: Universidad de San Andrés
Major: linguistica
Year at university/school: 
L1: Spanish
Father's native language: Spanish
Mother's nativelanguage: Spanish
Languages spoken at home: Spanish
Age of exposure to Spanish: 8
Years studying English: 39
Stay abroad in English speaking country (>= 1 month): Yes
Stay abroad (where): USA
Stay abroad (when): 2005-2010
Stay abroad (months): 60
Language certificates (type and level): Proficiency
Proficiency (self-assessment) speaking: Upper advanced (C2)
Proficiency (self-assessment) listening: Upper advanced (C2)
Proficiency (self-assessment) reading: Upper advanced (C2)
Proficiency (self-assessment) writing: Upper advanced (C2)
Proficiency (self-assessment): 6 / 6
Additional foreign language(s): French
Proficiency (self-assessment) in additional language speaking: Lower intermediate (B1)
Proficiency (self-assessment) in additional language listening: Lower intermediate (B1)
Proficiency (self-assessment) in additional language reading: Lower advanced (C1)
Proficiency (self-assessment) in additional language writing: Upper beginner (A2)
Medium: Written
Task number: 3
Task title: Film
Writing/audio details: written_online
Minutes taken to complete the task: 5
Where the task was done: Outside classroom
Resources used: 
Text: This is the story of a young woman who lives with her old father. One 
day her father gets lost in the woods and ends up in a castle where he is
kidnapped by a hideaous beast. This beast is in fact a prince who was
transformed into a monster by a spell cast by a princess. The story goes that
the spell will be broken if the beast learns to love another human and earns
her love in return. Belle, the young girl, is horrified by the beast at the
beginning but as the days pass she falls in love with him. Things get
complicated by another male character, Gaston, who is in love with Belle. When
the Beast and Belle are about to live their love, Gaston attempts to kill the
beast but he dies when he falls from a tower. Finally, Belle tells the Beast
that she is is in love with him and the spell is broken.

The tables below present a full list of metadata with a brief explanatory description. First, we will present the learners’ metadata and next the natives’ metadata.

Table: learner’s metadata

Metadata Description
FILENAME:

Each file in the corpus has a unique code. For learners, the filename format is:

L1_medium_proficiencyscore(raw)_age_LoI_tasknumber_initials

  • L1: EN English, ES Spanish, DE German.
  • MEDIUM: WR (written), SP (spoken)
  • PROFICIENCY: Proficiency level based on the CEFR (A1, A2, B1, B2, C1, C2), which roughly corresponds to the six traditional proficiency levels (lower beginner, upper beginner, lower intermediate, upper intermediate, lower advanced, upper advanced).
  • AGE: in years.
  • LoI: Length of Instruction in English (i.e., years studying English), e.g.: 13 (thirteen years) or 8.5 (eight and a half years).
  • TASK NUMBER: number of the task (click on the help icon “?” on Task Title for details).
  • INITIALS: the participant’s initials (e.g., JFK).

For example, the file code ES_WR_B1_18_13_14_LJL represents a Spanish native, who produced a written task, with a B1 level (=lower intermediate), who is 18 years old, who has been learning English for 13 years, who did the Chaplin task (task #14) and whose initials are LJL.

YEAR DATA COLLECTION: The year when the data were collected, e.g., 2019.
PLACEMENT TEST RAW SCORE: The raw score obtained in the English placement test. Two placement tests were used:
  1. Oxford University Press. (2003). Quick Placement Text. Oxford University Press. (Paper and pen, version 1, freely available online) -->This test was administered to most L2 English learners. The test raw scores range from 0 to 60.
  2. Cambridge University Press. (2010). English Unlimited Placement Test. Cambridge University Press. (Paper and pen version, freely available online) -->This test was administered to L2 English learners with low proficiency level: younger learners (secondary school) and university students majoring in a modern language other than English. The test raw scores range from 0 to 120.
PLACEMENT TEST % SCORE: The placement test raw score transformed into percentage (0% minimum - 100% maximum).
PROFICIENCY: The classification of the placement test raw score into proficiency categories according to the following table:
CEFR level Proficiency level Oxford Quick Placement Test scores (and corresponding %) Cambridge Unlimited Placement Test scores (and corresponding %)
A1 Lower beginner 0-17 (0%-28.3%) 0-35 (0.0%-29.2%)
A2 Upper beginner 18-29 (30.0%-48.3%) 36-55 (30.0%-45.8%)
B1 Lower intermediate 30-39 (50.0%-65.5%) 56-75 (46.7%-62.5%)
B2 Upper intermediate 40-47 (66.7%-78.3%) 76-95 (63.3%-79.2%)
C1 Lower advanced 48-54 (80.0%-90.0%) 96-120 (80.0%-100%)
C2 Upper advanced 55-60 (91.7%-100%)
SEX: The participant’s sex (Male, Female, Unknown).
AGE: The participant’s age (in years).
SCHOOL/UNIVERSITY/INSTITUTION: The participant’s school or university name, if any.
MAJOR: The participant’s major subject at university, if any.
YEAR AT UNIVERSITY/SCHOOL: The participant’s year or course at school or university, if any.
L1: The participant’s L1 (native language).
FATHER'S NATIVE LANGUAGE: The native language of the participant’s father.
MOTHER'S NATIVE LANGUAGE: The native language of the participant’s mother.
LANGUAGE(S) SPOKEN AT HOME: The language(s) spoken at home.
AGE OF EXPOSURE TO ENGLISH: Age of Exposure (AoE), i.e., age at which the participant started learning English.
YEARS STUDYING ENGLISH: Length of Instruction (LoI) in English.
STAY IN ENGLISH-SPEAKING COUNTRY (≥1 MONTH): Stay(s) in any English-speaking country longer than one month.
STAY ABROAD (WHERE): English-speaking country of the stay.
STAY ABROAD (WHEN): Year(s) of the stay; or period(s) of the stay; or age of the participant when s/he did the stay.
STAY ABROAD (MONTHS): Length of the stay (in months), e.g., 3.5 months or 24 months or unknown (for cases where the stay is longer than one month but the participant did not specify the exact length in months).
LANGUAGE CERTIFICATES (TYPE AND LEVEL): Official language certificates held by the participant, if any.
PROFICIENCY (SELF-ASSESSMENT) SPEAKING:

The participant self-assesses his/her speaking level in English according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) LISTENING::

The participant self-assesses his/her listening level in English according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • LLower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) READING:

The participant self-assesses his/her reading level in English according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) WRITING:

The participant self-assesses his/her writing level in English according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT):

The participant’s average self-assessment score in the four skills together (speaking, writing, listening, reading) in English. according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)

It is calculated as follows: Each skill is self-scored, as described above, and then an average is obtained from the four scores.

ADDITIONAL FOREIGN LANGUAGE(S): Additional foreign languages (other than English) known by the participant, if any.
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL LANGUAGE SPEAKING:

The participant self-assesses his/her speaking level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL LANGUAGE LISTENING:

The participant self-assesses his/her listening level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL LANGUAGE READING:

The participant self-assesses his/her reading level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL LANGUAGE WRITING:

The participant self-assesses his/her writing level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
MEDIUM (WRITTEN/SPOKEN):

The medium in which the task was produced:

  • Written
  • Spoken
TASK NUMBER: This is the number of the task (1 to 14).
  • 2. Famous Person: Talk about a famous person.
  • 3. Film: Summarise a film you have seen recently.
  • 13. Frog: Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “One day... https://goo.gl/so3S6W
  • 14. Chaplin: Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once.
    https://www.youtube.com/watch?v=4QkTNJFhu-g
TASK TITLE: This is the title of the task (Famous person; Film; Frog; Chaplin).
WRITING/AUDIO DETAILS:

These are additional details about where the task was collected:

  • written_online (a written task that was collected via online forms on the internet).
  • written_offline_classroom (a handwritten task that was collected on pen and paper in the classroom and later transcribed).
  • spoken_online (a spoken task that was self-recorded by the learner in his/her computer while at home).
  • spoken_offline_classroom (a spoken task that was recorded by an assistant in the classroom).
  • spoken_offline_lab (a spoken task that was recorded by an assistant in a quiet lab and with a specialised recording equipment: Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa). These audio files are of the highest quality in the corpus and they are ideal for phoneticians and phonologists.
  • spoken_offline_googlemeet (a spoken task that was collected in an online face-to-face format via the Google Meet software due to the covid-19 pandemic in 2020 and 2021).
MINUTES TAKEN TO COMPLETE THE TASK: The time taken to complete the task as self-reported by the participant (sometimes there is no self-reported information in this metadata).
WHERE WAS THE TASK DONE:

The location where the task was done:

  • Inside the classroom.
  • Outside the classroom.
  • Both inside and outside the classroom (i.e., the task was done in the classroom, for example, and then finished off at home).
RESOURCES USED:

The resources the participant used to complete the task, if any:

  • Help from an English native speaker
  • Bilingual dictionary (English/Learners’ L1)
  • Monolingual dictionary (English)
  • Spellchecker
  • Grammar book
  • Background readings about the task topic (newspapers, internet, TV, etc.)
TEXT: This is the text produced in the task (either the written text or the transcription of the spoken text).

Table: Native’s metadata

Metadata Description
FILENAME:

Each file in the corpus has a unique filename. The filename format for the natives is:

L1_medium_age_tasknumber_initials

For example, the file code EN_WR_18_13_BB represents an English native who produced a written task, who is 18 years old, who did task #13 (Frog story) and whose initials are BB.

For example, the file code ES_WR_25_13_FBL represents a Spanish native who produced a written task, who is 25 years, who did task number 13 (Frog) and whose initials are FBL.

YEAR DATA COLLECTION: The year when the data were collected.
INITIALS: The participant’s initials (e.g., JFK).
SEX: The participant’s sex (Male, Female, Unknown).
AGE: The participant’s age (in years).
SCHOOL/UNIVERSITY/INSTITUTION: The participant’s school or university name, if any.
MAJOR: The participant’s major subject at university, if any.
YEAR AT UNIVERSITY/SCHOOL: The participant’s year or course at school or university, if any.
L1: The participant’s L1 (native language).
VARIETY OF NATIVE LANGUAGE (COUNTRY): The variety of the participant’s L1 (e.g., American, British, Australian, Canadian, etc).
FATHER'S NATIVE LANGUAGE: The native language of the participant’s father.
MOTHER'S NATIVE LANGUAGE: The native language of the participant’s mother.
LANGUAGE(S) SPOKEN AT HOME: The language(s) spoken at home.
ANY FOREIGN LANGUAGE?: Whether the participant knows a foreign language (yes, no).
FOREIGN LANGUAGE: Foreign language known by the participant, if any.
PROFICIENCY (SELF-ASSESSMENT) FOREIGN LANGUAGE SPEAKING:

The participant self-assesses his/her speaking level in the foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) FOREIGN LANGUAGE LISTENING:

The participant self-assesses his/her listening level in the foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) FOREIGN LANGUAGE READING:

The participant self-assesses his/her reading level in the foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) FOREIGN LANGUAGE WRITING:

The participant self-assesses his/her writing level in the foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
ADDITIONAL FOREIGN LANGUAGE(S): Foreign language(s) known by the participant, if any.
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL FOREIGN LANGUAGE SPEAKING:

The participant self-assesses his/her speaking level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL FOREIGN LANGUAGE LISTENING:

The participant self-assesses his/her listening level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
SELF-ASSESSMENT IN ADDITIONAL FOREIGN LANGUAGE (READING):

The participant self-assesses his/her reading level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
PROFICIENCY (SELF-ASSESSMENT) IN ADDITIONAL FOREIGN LANGUAGE WRITING:

The participant self-assesses his/her writing level in the additional foreign language according to a 6-point scale:

  • Lower beginner (A1)
  • Upper beginner (A2)
  • Lower intermediate (B1)
  • Upper intermediate (B2)
  • Lower advanced (C1)
  • Upper advanced (C2)
MEDIUM (WRITTEN/SPOKEN):

The medium in which the task was produced:

  • Written
  • Spoken
TASK NUMBER: This is the number of the task (1 to 14).
  • 2. Famous Person: Talk about a famous person.
  • 3. Film: Summarise a film you have seen recently.
  • 13. Frog: Tell the story shown in the pictures. You can add new aspects to the story or ignore some aspects in the pictures. Your text should start “One day... https://goo.gl/so3S6W
  • 14. Chaplin: Watch the following Chaplin video clip (4 minutes). Summarise the story. You can watch the video clip more than once.
    https://www.youtube.com/watch?v=4QkTNJFhu-g
TASK TITLE: This is the title of the task (Famous person; Film; Frog; Chaplin).
WRITING/AUDIO DETAILS:

These are additional details about where the task was collected:

  • written_online (a written task that was collected via online forms on the internet).
  • written_offline_classroom (a handwritten task that was collected on pen and paper in the classroom and later transcribed).
  • spoken_online (a spoken task that was self-recorded by the learner in his/her computer while at home).
  • spoken_offline_classroom (a spoken task that was recorded by an assistant in the classroom).
  • spoken_offline_lab (a spoken task that was recorded by an assistant in a quiet lab and with a specialised recording equipment: Audio Technica AT2020: Cardioid condenser microphone, 74 dB, 1 kHz at 1 Pa). These audio files are of the highest quality in the corpus and they are ideal for phoneticians and phonologists.
  • spoken_offline_googlemeet (a spoken task that was collected in an online face-to-face format via the Google Meet software due to the covid-19 pandemic in 2020 and 2021).
MINUTES TAKEN TO COMPLETE THE TASK: The time taken to complete the task as self-reported by the participant (sometimes there is no self-reported information in this metadata).
WHERE WAS THE TASK DONE:

The location where the task was done:

  • Inside the classroom.
  • Outside the classroom.
  • Both inside and outside the classroom (i.e., the task was done in the classroom, for example, and then finished off at home).
RESOURCES USED:

The resources the participant used to complete the task, if any:

  • Monolingual dictionary (in the learners’ L1)
  • Spellchecker
  • Grammar book
  • Background readings about the task topic (newspapers, internet, TV, etc.)
TEXT: This is the text produced in the task (either the written text or the transcription of the spoken text).