What is tagging?
Part-of-Speech tagging (POS tagging) consists of automatically assigning tags to words. Each word is tagged (=labelled) according to its linguistic category. A simplified form of POS tagging is similar to what we used to do at school when identifying words as Nouns, Verbs, Adjectives, Prepositions. For example, the word ‘friend’ will be tagged as NCS, which means that it is a Noun, Common, Singular.
Which COREFL subcorpora are tagged?
In this version of COREFL (version 1), the English and Spanish components of COREFL have been POS tagged: the L2 English learner subcorpora, the L1 Spanish native subcorpus, and the L1 English native subcorpus.
What is POS tagging used for in COREFL?
When searching the COREFL corpus, you can do two types of searches:
- Searching for a word: you can do a simple search for individual words like are, table, or for a combination of words like there is, I like, in spite of. This is called ‘string’ search.
- Searching for a word category: you can do an advanced search by looking for a Verb, or for a Noun, or for a combination like Adjective+Noun (an adjective followed by a noun) or Noun+Adjective (which is a word order typically produced by many learners of English). This gives you a more sophisticated way of searching for constituents in the corpus. Please check the tag ‘Web Interface: User manual’ for further details on more sophisticated searches.
When doing an advanced search, the corpus must have been previously POS tagged. This is why COREFL has been POS tagged.
Which tags have been used?
COREFL subcorpora have been automatically POS tagged with the FreeLing tagger. For an interpretation of the tags, see the FreeLing tagset description and, more specifically, the English tagset and the Spanish tagset. You can also see an online demo of FreeLing where you can introduce your own text and it will be automatically tagged.
An important note on automatic POS tagging
Please note that in this version of COREFL the POS tagging has been done automatically, which implies that some words produced by learners might have been incorrectly categorised due to the very nature of learners’ language. This is so because the POS tagger automatically applies English native categories onto the learner language (L2 English), e.g.:
- “In the night, the boy go to bed and the frog scape the room.”: the word ‘go’ is tagged as an infinitive verb when the learner should have used the irregular past tense verb ‘went’; ‘scape’ is automatically tagged as a noun but we know the learner actually intended to use the regular past tense ‘escaped’.
- “When he finds his love, he discovers that she has alzehimer to”: the word ‘to’ is automatically tagged as a preposition, though we know that the learner intended to use the adverb ‘too’.
- “The boy looking foor the frog”: the non-existent word ‘foor’ is automatically tagged as a noun, though we know the learner intended to use the preposition ‘for’.
- “he falled into the river”: the form ‘falled’ has been correctly tagged by the automatic tagger as a past tense verb, however the tag misses an important aspect which is typical of the learners’ interlanguage: ‘falled’ is an overregularization of the regular past tense ‘-ed’ morpheme. So, looking for overregularizations is not possible when automatic tagging is used.
Despite the shortcomings of automatic tagging, we believe that this type of tagging is still very useful for those users who want to do complex searches, e.g., two advanced searches comparing the word order Noun+Adjective (dog white) vs. Adjective+Noun (white dog). The automatic tagger may not always tag learners’ misspelled adjectives as adjectives. However, properly spelled adjectives, which are the majority, will be correctly tagged as adjectives. Therefore, automatic tagging in a learner corpus is more useful than no tagging at all.