No category

Download Universiteit Leiden Computer Science

Transcript

Chapter 3. Website for UML Database Collection
3.1 Improvement to Distributed Web Crawler
There is a problem in ImageCrawler that the number of images has been limited by Google image
search, the solution is to improve the hardware. Because we have implemented this ImageCrawler
on a PC, to start from zero is not efficient. That's why we use the result of Google image search as
the starting node.
The ImageCrawler program is implemented by a PC which does not have the ability to deal with
great capacity of information from the Internet. A web crawler of high performance can be reached
by distributed system.[21][22] Most search engines have different IDCs (Internet Data Center). Each
IDC is in charge of a range of the IPs. Within each IDC, there are many instances of web crawler
working parallel.
There are two methods used by the distributed web crawler.
1. All web crawlers of the distributed system rely on the same connection to the Internet. They
co-operate to fetch data and combine the result within a local area network. The advantage of this
method is that it's easy to expand the hardware resource. The bottleneck will appear in the
bandwidth of the connection to the Internet that are shared by all instances of web crawlers.
2. The web crawlers of a distributed system are far away from each other geographically. The
advantage of this method is that there is no problem about the insufficiency of bandwidth.
However, the conditions of the network differ from place to place. Thus, how to integrate the
result of different web crawlers is the most important problem.
3.2 Website for Database Collection
As the hardware for building a distributed system is not possible at the moment, a website can be
established to collecting databases from users who have downloaded the ImageCrawler program.
The website provides the download link of the program. When a user has built a database, he can
upload it in the website. After a database has been uploaded, the website will read its content and
write them into the database saved in the server that contains all the information from the
databases uploaded.
It is an alternative way to the distributed system of web crawler. If more and more people use the
program to collect images and upload the URLs of the images, together with the
“whitelist”/”blacklist” showing which URLs contain images that are (not) UML class diagrams,
and the statistics of the domain from the “whitelist”/”blacklist”, a database will be available on the
server that contains the index of a large number of images that are UML class diagrams. It's like
the index of images built by a search engine, only that people using the software take the place of
web crawlers.
25

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Download Universiteit Leiden Computer Science