Download Adaptive Revisiting with Heritrix
Transcript
considered a part of the content. The classic example would be a web page, possibly the front page of a news service that prints the date and time on each visit. This value will change with every visit, but does not constitute a change in the content. Documents like this could cause the adaptive strategy to severely over-visit them, wasting bandwidth and storage as well as placing an unnecessary strain on the web servers in question. It would also cause additional effort later when trying to analyze the contents of the archive, since it contained duplicates that could not be automatically detected and would need to be manually sorted. Unfortunately, while it is quite simple and straightforward to compare two documents on a bit by bit level, trying to compare the 'content' is much more complicated. It requires some understanding of what constitutes content and what can be ignored. One option, in dealing with HTML files, is to ignore the markup code since, typically, it only contains layout information. Yet even this falls far short of our needs, since we have already pointed out an example where irrelevant information would be a part of an HTML's 'content'. Therefore the distinction drawn between content and presentation in HTML is of no use to us. Some alternatives do present themselves. We know that, primarily, the problem relates to web servers inserting dynamically generated sections unrelated to the main text of the documents. Therefore, the sections are likely to be a relatively small part of the document. In such an instance, a 'roughly the same' or ‘close enough’ comparison might suffice. That is to say, instead of comparing two documents and declaring them either identical or different, we use a gradient for the difference between them. Documents that are sufficiently similar are treated as being unchanged. This means minor changes will be ignored, whereas changes to the general body of a document should be detected. A good deal of work has been done on the subject, including such methods as shingling[6] that provide pretty good results. Unfortunately not all minor changes are unimportant. Consider for example the press release archive of a company, government agency or other similar entity. Let's suppose that several months after issuing a press release it has become clear that the policy set forth is no longer entirely valid, might in fact show the issuing party in a bad light. The temptation to 'tone down' the headline to lessen the impact might well be too much 48