Download Introduction to Computational Linguistics - Uni
Transcript
10. Characters, Strings and Regular Expressions
Definition 2 Let ~u be a string.
|~u| = m and |~v| = n. Then ~ua~v
follows.
~u( j)
(70)
(~ua~v)( j) =
~v( j − m)
27
The length of ~u is denoted by |~u|. Suppose that
is a string of length m + n, which is defined as
if j < m,
else.
~u is a prefix of ~v if there is a w
~ such that ~v = ~ua w
~ , and a postfix if there is a w
~ such
a
~ ~u. ~u is a substring if there are w
~ and ~x such that ~v = w
~ a~va ~x.
that ~v = w
OCaML has a function String.length that returns the length of a given string.
For example, String.length "cat" will give 3. Notice that by our conventions,
you cannot access the symbol with number 3. Look at the following dialog.
(71)
# "cat".[2];;
- : char = ’t’
# "cat".[String.length "cat"];;
Exception: Invalid_argument "String.get".
The last symbol of the string has the number 2, but the string has length 3. If
you try to access an element that does not exist, OCaML raises the exception
Invalid_argument "String.get".
In OCaML, ^ is an infix operator for string concatenation. So, if we write
"tom"^"cat" OCaML returns "tomcat". Here is a useful abbreviation. ~xn denotes the string obtained by repeating ~x n–times. This can be defined recursively
as follows.
(72)
(73)
~x0 = ε
~xn+1 = ~xna ~x
So, vux3 = vuxvuxvux.
A language over A is a set of strings over A. Here are a few useful operations
on languages.
(74a)
(74b)
(74c)
L · M := {~x~y : ~x ∈ L, ~y ∈ M}
L/M := {~x : exists ~y ∈ M: ~x~y ∈ L}
M\L := {~x : exists ~y ∈ M: ~y~x ∈ L}