Download A framework for processing and presenting parallel text corpora

Transcript
156
Appendix A
·
Constants
Predefined character blocks in Unicode 3.0
LowSurrogates
CJKCompatibilityIdeographs
ArabicPresentationForms-A
CJKCompatibilityForms
ArabicPresentationForms-B
HalfwidthandFullwidthForms
PrivateUse
AlphabeticPresentationForms
CombiningHalfMarks
SmallFormVariants
Specials
Specials
Table A.1: LanguageExplorer supports the character block names defined in Unicode 3.0 when constructing certain regular expressions (see section 5.4.6 on page 130). Notice that these names omit
the space characters which are used in the Unicode standard as word separators (e.g. “BasicLatin” is
defined as “Basic Latin”).
The character categories defined Unicode 3.0
Category
Explanation
L
Lu
Ll
Lt
Lm
Lo
Letter.
Uppercase letter.
Lowercase letter.
Title case letter.
Modifier letter.
Any other letter.
N
Nd
Nl
No
Number.
Decimal digit.
Letter number.
Any other number.
S
Sm
Sc
Sk
So
A symbol.
A mathematical symbol.
A currency symbol.
A modifier symbol.
Any other symbol.
Characters
Numbers
Symbols
P
Pc
Pd
Ps
Pe
Pi
Pf
Po
Z
Zs
Zl
Zp
M
Mn
Mc
Me
C
Punctuation marks
A punctuation mark.
A connector.
A dash.
An opening punctuation mark.
A closing punctuation mark.
An initial quote.
A final quote.
Any other punctuation mark.
Separators
A separator.
A space separator.
A line separator.
A paragraph separator.
Combining marks
A combining mark.
A nonspacing mark.
A spacing combining mark.
An enclosing mark.
Other characters
Any other characters.
..to be continued on the next page ➥
Dissertation der Fak. f. Informations- u. Kognitionswissenschaften, Univ. Tübingen - 2004