Download Second Edition

Transcript
42
CHAPTER 6. EXTRACTING DATA FROM FOREIGN WEB SITES
1. Checking form data by ensuring that data entered in forms follow the expected syntax. If a number is expected in an HTML form, the server program
must check that it is actually a number that has been entered. This particular use of regular expressions is covered in Chapter 8. Regular expressions
can only check syntax; that is, given a date, a regular expression cannot
easily be used to check the validity of the date (e.g., that the date is not
February 30). However, a regular expression may be used to check that the
date has the ISO-format YYYY-MM-DD.
2. Extracting data from foreign Web sites, as in the Currency Service above.
In the following we shall often use the term “pattern” instead of the longer
“regular expression”. The syntax of regular expressions is defined according to the
description in Figure 6.2.
A character class class is a set of ASCII characters defined according to Figure 6.3.
Potential use of regular expressions is best illustrated with a series of examples:
• [A-Za-z] : matches all characters in the english alphabet.
• [0-9][0-9] : matches numbers containing two digits, where both digits may
be zero.
• (cow|pig)s? : matches the four strings cow, cows, pig, and pigs.
• ((a|b)a)* : matches aa, ba, aaaa, baaa, . . . .
• (0|1)+ : matches the binary numbers (i.e., 0, 1, 01, 11, 011101010,. . . ).
• .. : matches two arbitrary characters.
• ([1-9][0-9]+)/([1-9][0-9]+) : matches positive fractions of whole numbers (e.g., 1/8, 32/5645, and 45/6). Notice that the pattern does not match
the fraction 012/54, nor 1/0.
• <html>.*</html> : matches HTML pages (and text that is not HTML).
• www\.(((it-c|itu)\.dk)|(it\.edu)) : matches the three Web addresses
www.itu.dk, www.it-c.dk, and www.it.edu.
• http://hug.it.edu:8034/ps2/(.*)\.sml : matches all URLs denoting .sml
files on the machine hug.it.edu in directory ps2 for the service that runs
on port number 8034.
In the next section, we turn to see how regular expressions may be used with
SMLserver.