Download Second Edition
Transcript
42 CHAPTER 6. EXTRACTING DATA FROM FOREIGN WEB SITES 1. Checking form data by ensuring that data entered in forms follow the expected syntax. If a number is expected in an HTML form, the server program must check that it is actually a number that has been entered. This particular use of regular expressions is covered in Chapter 8. Regular expressions can only check syntax; that is, given a date, a regular expression cannot easily be used to check the validity of the date (e.g., that the date is not February 30). However, a regular expression may be used to check that the date has the ISO-format YYYY-MM-DD. 2. Extracting data from foreign Web sites, as in the Currency Service above. In the following we shall often use the term “pattern” instead of the longer “regular expression”. The syntax of regular expressions is defined according to the description in Figure 6.2. A character class class is a set of ASCII characters defined according to Figure 6.3. Potential use of regular expressions is best illustrated with a series of examples: • [A-Za-z] : matches all characters in the english alphabet. • [0-9][0-9] : matches numbers containing two digits, where both digits may be zero. • (cow|pig)s? : matches the four strings cow, cows, pig, and pigs. • ((a|b)a)* : matches aa, ba, aaaa, baaa, . . . . • (0|1)+ : matches the binary numbers (i.e., 0, 1, 01, 11, 011101010,. . . ). • .. : matches two arbitrary characters. • ([1-9][0-9]+)/([1-9][0-9]+) : matches positive fractions of whole numbers (e.g., 1/8, 32/5645, and 45/6). Notice that the pattern does not match the fraction 012/54, nor 1/0. • <html>.*</html> : matches HTML pages (and text that is not HTML). • www\.(((it-c|itu)\.dk)|(it\.edu)) : matches the three Web addresses www.itu.dk, www.it-c.dk, and www.it.edu. • http://hug.it.edu:8034/ps2/(.*)\.sml : matches all URLs denoting .sml files on the machine hug.it.edu in directory ps2 for the service that runs on port number 8034. In the next section, we turn to see how regular expressions may be used with SMLserver.