Download Adaptive Revisiting with Heritrix

Transcript
considered a part of the content. The classic example would be a web
page, possibly the front page of a news service that prints the date and
time on each visit. This value will change with every visit, but does not
constitute a change in the content. Documents like this could cause the
adaptive strategy to severely over-visit them, wasting bandwidth and
storage as well as placing an unnecessary strain on the web servers in
question. It would also cause additional effort later when trying to analyze
the contents of the archive, since it contained duplicates that could not be
automatically detected and would need to be manually sorted.
Unfortunately, while it is quite simple and straightforward to compare
two documents on a bit by bit level, trying to compare the 'content' is
much more complicated. It requires some understanding of what
constitutes content and what can be ignored. One option, in dealing with
HTML files, is to ignore the markup code since, typically, it only contains
layout information. Yet even this falls far short of our needs, since we
have already pointed out an example where irrelevant information would
be a part of an HTML's 'content'. Therefore the distinction drawn between
content and presentation in HTML is of no use to us.
Some alternatives do present themselves. We know that, primarily, the
problem relates to web servers inserting dynamically generated sections
unrelated to the main text of the documents. Therefore, the sections are
likely to be a relatively small part of the document. In such an instance, a
'roughly the same' or ‘close enough’ comparison might suffice. That is to
say, instead of comparing two documents and declaring them either
identical or different, we use a gradient for the difference between them.
Documents that are sufficiently similar are treated as being unchanged.
This means minor changes will be ignored, whereas changes to the
general body of a document should be detected.
A good deal of work has been done on the subject, including such
methods as shingling[6] that provide pretty good results.
Unfortunately not all minor changes are unimportant. Consider for
example the press release archive of a company, government agency or
other similar entity. Let's suppose that several months after issuing a press
release it has become clear that the policy set forth is no longer entirely
valid, might in fact show the issuing party in a bad light. The temptation
to 'tone down' the headline to lessen the impact might well be too much
48