[SoC 2007] the Web Crawler Library

Hi: I am a sophomore undergraduate student from China, following Software Engineering Major. I have submitted an application yet, it is about to implement The Web Crawler Library. I know there is famous C++ Web Crawler program called Larbin, and Larbin is very efficient, can fetch 5,000,000 pages a day on a standard PC. But, Larbin have several disadvantages. For example, Larbin can only run under LINUX or other BSD platform, it's not support WINDOWS. Furthermore, Larbin's expansibility is not very good, can not work in some project. The most important is Larbin has stopped developing since July the 9th, 2003. Hence, I want design a new Web Crawler Library. And my proposals are as follow: 1. Simple Simple is very important in libraries, nobody wants trouble. Therefore, keep easy to use, just like other Boost libraries. 2. Highly Configurable Users can control only crawler the appointed website, and appoint the depths, file formats, the number of processes, threads. And this library just sees to "Crawler", the work of "parser" and storage mode hands to users. 3. Support all operating platform. This is the most important thing for standard libraries. (Larbin can only run under Linux and BSD) 4. Multi-thread and multi-process For higher speed, multi-thread and multi-process is required. 5. According to the website updated analysis Just download what is new file or updated file, can save time and website traffic 6. Comply to the ROBOT.TXT rules If there is possible, I want to add Server/Client mode, allow a large number of programs runs together. Even if my project is rejected, I also want to implement this project. I will continue my "Summer of Code" journey :) It is very late in China, about 1:30 am, so I'd go to bed. Good Night Yang Lang
participants (2)
-
Mathias Gaunard
-
Yang Lang