Download Adaptive Revisiting with Heritrix

Transcript
By using the Berkley DB, the concerns of writing to and reading from
disk, including caching and other related issues, were effectively removed
from the Frontier and relegated to a third party tool that was written
expressly for the purpose of managing a large amount of data, further
improving performance. We will discuss the Berkley DB in more detail
later, in chapter 6.2.
Aside from the much improved handling of queues, the BdbFrontier is
very much like the HostQueuesFrontier and implements essentially the
same snapshot strategy. It does improve on several points, including the
ability to add a 'budget' for each queue. The cost of each URI is evaluated
based on the selected policy and once a queue’s budget is exhausted no
further URIs are crawled. Also, if using the 'hold queues' option, that is
focusing on a small number of queues at a time, the queues remain active
for a fixed amount of 'cost,' after which time they become inactive until it
is their turn again.
The cost/budget addition to the Frontier enhances the possibilities in
configuring broad crawls. As was discussed earlier, broad crawls typically
trade-off on the depth with which they crawl each site, in favor of
crawling more sites.
3.5.3 AbstractFrontier
The AbstractFrontier was developed alongside the BdbFrontier and is
meant to be a partial implementation of a generic Frontier. That is, it is
meant to implement those parts of the Frontier that are largely
independent of the crawling strategy being implemented. This includes, to
various degrees, management of numerous general purpose settings,
statistics, maintaining a recovery log and more. Included are numerous
useful methods to evaluate URIs etc. It also handles URI canonicalization.
The BdbFrontier subclasses the AbstractFrontier. The AbstractFrontier is
provided to simplify the creation of new Frontiers by doing all the routine
work here, allowing new Frontiers to focus on their crawling strategies,
rather than having to tackle all the mundane aspects of a Frontier.
Furthermore, this can simplify code maintenance in the future, both if a
change to the Frontier API is introduced, and also if new or improved
functionality is developed that should be applied to most or all Frontiers.
29