Download Adaptive Revisiting with Heritrix
Transcript
By using the Berkley DB, the concerns of writing to and reading from disk, including caching and other related issues, were effectively removed from the Frontier and relegated to a third party tool that was written expressly for the purpose of managing a large amount of data, further improving performance. We will discuss the Berkley DB in more detail later, in chapter 6.2. Aside from the much improved handling of queues, the BdbFrontier is very much like the HostQueuesFrontier and implements essentially the same snapshot strategy. It does improve on several points, including the ability to add a 'budget' for each queue. The cost of each URI is evaluated based on the selected policy and once a queue’s budget is exhausted no further URIs are crawled. Also, if using the 'hold queues' option, that is focusing on a small number of queues at a time, the queues remain active for a fixed amount of 'cost,' after which time they become inactive until it is their turn again. The cost/budget addition to the Frontier enhances the possibilities in configuring broad crawls. As was discussed earlier, broad crawls typically trade-off on the depth with which they crawl each site, in favor of crawling more sites. 3.5.3 AbstractFrontier The AbstractFrontier was developed alongside the BdbFrontier and is meant to be a partial implementation of a generic Frontier. That is, it is meant to implement those parts of the Frontier that are largely independent of the crawling strategy being implemented. This includes, to various degrees, management of numerous general purpose settings, statistics, maintaining a recovery log and more. Included are numerous useful methods to evaluate URIs etc. It also handles URI canonicalization. The BdbFrontier subclasses the AbstractFrontier. The AbstractFrontier is provided to simplify the creation of new Frontiers by doing all the routine work here, allowing new Frontiers to focus on their crawling strategies, rather than having to tackle all the mundane aspects of a Frontier. Furthermore, this can simplify code maintenance in the future, both if a change to the Frontier API is introduced, and also if new or improved functionality is developed that should be applied to most or all Frontiers. 29