Download A Softbot for the World Wide Web

Transcript
4.4 The Experimental Version
73
74
75
76
77
78
79
80
81
82
83
84
85
55
## download and analyze pictures
COUNTER=0
for LINE in `cat pictures.txt`; do
PICTURE=`echo $LINE | awk -F',' '{print $1}'`
PAGE=`echo $LINE | awk -F',' '{print $2}'`
COUNTER=`echo $COUNTER+1 | bc`
SUFFIX=img_`echo 000$COUNTER | sed -e 's/^.*\([0-9]\{4\}\)$/\1/'`
echo "downloading picture $SUFFIX"
java getPicture $PICTURE $PAGE $SUFFIX dbentry.txt
PICTURE=`awk '/^local:/{ print $2 }' $SUFFIX'.txt'`
bit -v -h -c -p -d -s -f $PICTURE | tail +3 | strings >> $SUFFIX.txt
done
All pictures from le pictures.txt are downloaded sequentially. The format of le pictures.txt is:
<picture-URL>,<page-URL>
<picture-URL>,<page-URL>
<picture-URL>,<page-URL>
...
From each line, the URL of a picture and the URL from the Page where this picture was
found is stored in the variables PICTURE and PAGE respectively. A new basename for
the picture is generated using a COUNTER and stored in SUFFIX. The new basenames
of the pictures are img 0001, img 0002, etc.
The pipe to the UNIX command bc in line 78 is necessary, since the bourne shell does not
include any arithmetic operations at all.
In line 81, the Java application getPicture is called with PICTURE, PAGE, SUFFIX and
the name of the le containg the database entry. getPicture downloads the picture to
basename . extension , with the new basename and the extension .gif, .jpg, or .png
depending on the format of the picture. Since the picture is downloaded completely anyway, the former restriction to the 400 bytes for analyzing the picture does not apply.
This is useful especially for the older JPEG version 1.00 or 1.01, which usally could not
be analyzed using the begin of the le only and therefore caused bad results sometimes.
Besides downloading the picture, getFile creates a le basename .txt containing information about the picture, i.e. its parameters and information from the database. In line
82 the variable PICTURE is set to the name of the picture on the local lesystem. This
information is extraced from basename .txt using the awk tool6. In line 82, bit [Haa99]
is called with PICTURE as parameter and the resulting information appended at the le
basename .txt.
<
> <
>
<
<
<
>
>
86
87
88
89
6
## set field seperator to previous value
IFS=$OLD_IFS
## optional: delete pictures not to be shown
awk is an acromym for Aho, Weinberger, Kernighan.
>