[SoC 2007] the Web Crawler Library

24 Mar 2007

      Hi:

I am a sophomore undergraduate student from China, following Software
Engineering Major. I have submitted an application yet, it is about to
implement The Web Crawler Library.

I know there is famous C++ Web Crawler program called Larbin, and Larbin is
very efficient, can fetch 5,000,000 pages a day on a standard PC. But,
Larbin have several disadvantages. For example, Larbin can only run under
LINUX or other BSD platform, it's not support WINDOWS. Furthermore, Larbin's
expansibility is not very good, can not work in some project. The most
important is Larbin has stopped developing since July the 9th, 2003.

Hence, I want design a new Web Crawler Library. And my proposals are as
follow:

1.      Simple

Simple is very important in libraries, nobody wants trouble. Therefore, keep
easy to use, just like other Boost libraries.

2.      Highly Configurable

Users can control only crawler the appointed website, and appoint the
depths, file formats, the number of processes, threads.

And this library just sees to "Crawler", the work of "parser" and storage
mode hands to users.

3.      Support all operating platform.

This is the most important thing for standard libraries. (Larbin can only
run under Linux and BSD)

4.      Multi-thread and multi-process

For higher speed, multi-thread and multi-process is required.

5.      According to the website updated analysis

Just download what is new file or updated file, can save time and website
traffic

6.      Comply to the ROBOT.TXT rules

If there is possible, I want to add Server/Client mode, allow a large number
of programs runs together.

Even if my project is rejected, I also want to implement this project. I
will continue my "Summer of Code" journey :)

It is very late in China, about 1:30 am, so I'd go to bed. Good Night

Yang Lang

Yang Lang

Mathias Gaunard

tags

participants (2)