5 Differences between the original corpus and the tagged versions

5 Differences between the original corpus and the tagged versions

Apart from the addition of a tag to each word and consequent changes in the format, the tagged versions differ from the original corpus in several respects. A small number of errors in the text were discovered and corrected during the tagging process. More important, a number of changes were made during the pre-editing which prepared the texts for the automatic programs (cf 4.1). Some of the changes are marked by an @ flag in the vertical version of the tagged corpus (see 2.6).

5.1 Capitalisation

Initial capitals in the original corpus were changed to lower case at the beginning of sentences, in headings, book titles, quotations, and descriptive naming expressions. The reasoning is that in these cases the capital is not a permanent characteristic of the word and does not signal its grammatical behaviour. The only words which keep their word-initial capital are:

The first-person pronoun I.
The words for the months and the days of the week (tagged NR).
Words to be tagged NP, NPL, NPT, NNP, JNP, i.e. 'true' proper names and other words habitually written with a word-initial capital. See 7.7.

Examples of changes in capitalisation are:

the	BECOMES	the
British		British
Broadcasting		broadcasting
Corporation		corporation
Look		look
Back		back
in		in
Anger		anger

In the first example British is a JNP (adjective habitually written with a word-initial capital) and therefore retains its initial capital.

Since there is no clear borderline between 'true' proper names and descriptive naming expressions and since the capitalisation of a word (e.g. Catholic) is not always consistent, it was not possible to arrive at complete consistency in the use of upper- vs lower-case initials in the tagged corpus. Some guidance to changes in capitalisation is given in the 'special information' columns of the vertical version of the tagged corpus (see 2.2, 2.6). For exact information on capitalisation, the user must consult the original corpus text.

5.2 Punctuation marks and sentence/paragraph division

Punctuation marks are treated as separate 'words' and are given their own tags (see 7.24). Full stops are, however, only treated as 'words' when they mark the end of sentences, not when used as decimal points or abbreviation points (as in e.g). Note the following changes in punctuation with respect to the original corpus:

Abbreviation points are deleted at the end of words. Thus, for example, Mr. and Mr are no longer distinguished.

The comment tags **[BEGIN QUOTE**], **[END QUOTE**], and **[MIDDLE OF QUOTE**] are converted to appropriate quote marks, *" or **".

Markers of the opening (*<) and close (*>) of headings are converted to sentence- initial markers (see below) and full stops (.), respectively. In the vertical version of the tagged corpus a marker H is inserted under 'special information' (see 2.6). There is no special marking of headings in the horizontal version, except that they are placed on a separate line.

Note further the following changes in the marking of sentences and paragraphs (cf Johansson et al 1978:26f):

The begin-list marker, which was used in the original corpus to indicate the beginning of word sequences without syntactic structure (e.g.. in recipes), has been converted to a sentence- initial marker.
In the vertical version of the corpus the marker for included sentences is converted to an ordinary sentence-initial marker, e.g. before the quotation in: She said, 'Let's go," and left immediately. A marker I is included under special information' (see 2.6).
In contrast to the original corpus, 'paragraph indicators' in the text such as 1. a) B. are treated as separate sentences. Thus, for example, additional sentence- initial markers are inserted in:
They sought to answer three questions: 1. Are there differences in adjustment to ageing and retirement according to the occupational level of employees? 2. If so, which occupational levels are ...
The paragraph marker in the original corpus has been deleted. In the vertical version of the tagged corpus a marker P is inserted under 'special information' (see 2.6). In the horizontal version paragraph division is marked by indentation.

As a result of these changes, we find the following marking of sentence/paragraph division in the tagged versions of the corpus (the marking in the original corpus is given within parentheses):

	Vertical version	Horizontal version	KWIC concordance
Beginning of sentences (^)	----	^	^
Included sentences (~)	----- I ('special information')	~	~
Beginning of heading (*<)	---- H ('special information')	^ separate line	^
Beginning of list (_)	-----	^	^
Beginning of paragraph (\|)	P ('special information')	indentation	3 spaces

5.3 Contractions

Contracted words are split up in the tagged corpus (indicated under 'special information' in the vertical version of the corpus; cf 2.6). See further the end of 7.24 (under 'apostrophe').

5.4 Codes for abbreviations and 'non-English' words

In the original corpus an abbreviation is marked in one of two ways: it is either preceded by \0, as in \OMr; or it is enclosed in curly brackets, as in {OU.S.A.}. In non-English' the tagged corpus all abbreviations are prefixed by \0. The similar 'non codes (cf Johansson et al 1978:28-30) are stripped away. Markers are inserted under 'special information' in the vertical version; see 2.6. In the horizontal version and the concordance there are no 'non-English' codes.

For information on the tagging of abbreviations and 'non-English' words, see 7.19-21.

5.5 Other differences

The following features have been deleted in the tagged versions of the corpus:

Markers of typeshifts (italics, boldface, etc): *0, *1, *2, *3, *4, *5, *6, *7, *8, *9
The special marking of roman numerals: *=, **=
Comment tags enclosed within **[ **], except those marking quotations (see 5.2) and formulas (see 7.22)