Download UNITEX

Transcript
7.2. CONSTRUCTION
81
units instead of construction one automaton per phrase. If have not applied the dictionaries, the phrase automaton that you obtain will consist of only one path made up only of
unknown words.
7.2.1 Rules of construction of text automata
The phrase automata are constructed starting from the text dictionaries. The obtained degree of ambiguity is therefore directly linked to the granularity of the descriptions of the
used dictionaries. From the phrase automaton in figure 7.3, you can conclude that the word
which has been code twice as a determinator in two subcategories of the category DET. This
granularity of descriptions will not be of any use if you are not interested in the grammatical
category of this word. It is therefore necessary to adapt the granularity of the dictionaries to
the intended use.
Figure 7.3: Double entry for which as a determinator
For each lexical unit of the phrase, Unitex searches for all possible interpretations in the
dictionary of the simple words of the text. Afterwards all lexical units that have an interpretation in the dictionary of the composite words of the text are sought. All the combinations
of their interpretations constitute the phrase automaton.
NOTE: if the text contains lexical labels (i.e. {aujourd’hui,.ADV}), these lables are
reproduced identically in the automaton whithout trying to decompose the sequences which
they represent.
In each box, the 1st line contains the inflected form found in the text, and the 2nd line
contains the canonical form if it is different. The other information is coded below the box.
(cf. section 7.3.1).
The spaces that separate the lexical units are not copied into the automaton save the
spaces inside composite words.
The casing of lexical units is conserved. For example, if the word Here is encountered,
the capital letter is conserved (cf. figure 7.1). This choice allows to keep this information