5 Differences between the original corpus and the tagged versions

Apart from the addition of a tag to each word and consequent changes in the format, the tagged versions differ from the original corpus in several respects. A small number of errors in the text were discovered and corrected during the tagging process. More important, a number of changes were made during the pre-editing which prepared the texts for the automatic programs (cf
4.1). Some of the changes are marked by an @ flag in the vertical version of the tagged corpus (see 2.6).

5.1 Capitalisation

Initial capitals in the original corpus were changed to lower case at the beginning of sentences, in headings, book titles, quotations, and descriptive naming expressions. The reasoning is that in these cases the capital is not a permanent characteristic of the word and does not signal its grammatical behaviour. The only words which keep their word-initial capital are:

Examples of changes in capitalisation are:

the

BECOMES

the

British

 

British

Broadcasting

 

broadcasting

Corporation

 

corporation

Look

 

look

Back

 

back

in

 

in

Anger

 

anger

In the first example British is a JNP (adjective habitually written with a word-initial capital) and therefore retains its initial capital.

Since there is no clear borderline between 'true' proper names and descriptive naming expressions and since the capitalisation of a word (e.g. Catholic) is not always consistent, it was not possible to arrive at complete consistency in the use of upper- vs lower-case initials in the tagged corpus. Some guidance to changes in capitalisation is given in the 'special information' columns of the vertical version of the tagged corpus (see 2.2, 2.6). For exact information on capitalisation, the user must consult the original corpus text.

5.2 Punctuation marks and sentence/paragraph division

Punctuation marks are treated as separate 'words' and are given their own tags (see 7.24). Full stops are, however, only treated as 'words' when they mark the end of sentences, not when used as decimal points or abbreviation points (as in e.g). Note the following changes in punctuation with respect to the original corpus:

Note further the following changes in the marking of sentences and paragraphs (cf Johansson et al 1978:26f):

As a result of these changes, we find the following marking of sentence/paragraph division in the tagged versions of the corpus (the marking in the original corpus is given within parentheses):

 

Vertical version

Horizontal version

KWIC concordance

Beginning of sentences (^)

----

^

^

Included sentences (~)

-----
I ('special information')

~

~

Beginning of heading (*<)

----
H ('special information')

^
separate line

^

Beginning of list (_)

-----

^

^

Beginning of paragraph (|)

P ('special information')

indentation

3 spaces

5.3 Contractions

Contracted words are split up in the tagged corpus (indicated under 'special information' in the vertical version of the corpus; cf 2.6). See further the end of 7.24 (under 'apostrophe').

5.4 Codes for abbreviations and 'non-English' words

In the original corpus an abbreviation is marked in one of two ways: it is either preceded by \0, as in \OMr; or it is enclosed in curly brackets, as in {OU.S.A.}. In non-English' the tagged corpus all abbreviations are prefixed by \0. The similar 'non codes (cf Johansson et al 1978:28-30) are stripped away. Markers are inserted under 'special information' in the vertical version; see 2.6. In the horizontal version and the concordance there are no 'non-English' codes.

For information on the tagging of abbreviations and 'non-English' words, see 7.19-21.

5.5 Other differences

The following features have been deleted in the tagged versions of the corpus: