Download Universiteit Leiden Computer Science
Transcript
Chapter 3. Website for UML Database Collection 3.1 Improvement to Distributed Web Crawler There is a problem in ImageCrawler that the number of images has been limited by Google image search, the solution is to improve the hardware. Because we have implemented this ImageCrawler on a PC, to start from zero is not efficient. That's why we use the result of Google image search as the starting node. The ImageCrawler program is implemented by a PC which does not have the ability to deal with great capacity of information from the Internet. A web crawler of high performance can be reached by distributed system.[21][22] Most search engines have different IDCs (Internet Data Center). Each IDC is in charge of a range of the IPs. Within each IDC, there are many instances of web crawler working parallel. There are two methods used by the distributed web crawler. 1. All web crawlers of the distributed system rely on the same connection to the Internet. They co-operate to fetch data and combine the result within a local area network. The advantage of this method is that it's easy to expand the hardware resource. The bottleneck will appear in the bandwidth of the connection to the Internet that are shared by all instances of web crawlers. 2. The web crawlers of a distributed system are far away from each other geographically. The advantage of this method is that there is no problem about the insufficiency of bandwidth. However, the conditions of the network differ from place to place. Thus, how to integrate the result of different web crawlers is the most important problem. 3.2 Website for Database Collection As the hardware for building a distributed system is not possible at the moment, a website can be established to collecting databases from users who have downloaded the ImageCrawler program. The website provides the download link of the program. When a user has built a database, he can upload it in the website. After a database has been uploaded, the website will read its content and write them into the database saved in the server that contains all the information from the databases uploaded. It is an alternative way to the distributed system of web crawler. If more and more people use the program to collect images and upload the URLs of the images, together with the “whitelist”/”blacklist” showing which URLs contain images that are (not) UML class diagrams, and the statistics of the domain from the “whitelist”/”blacklist”, a database will be available on the server that contains the index of a large number of images that are UML class diagrams. It's like the index of images built by a search engine, only that people using the software take the place of web crawlers. 25