Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infinite crawl #12

Open
martingg88 opened this issue May 15, 2015 · 4 comments
Open

infinite crawl #12

martingg88 opened this issue May 15, 2015 · 4 comments

Comments

@martingg88
Copy link

will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?

@jculvey
Copy link
Owner

jculvey commented May 15, 2015

A site will have a finite number of pages. The crawler avoids cycles by keeping a set of previously visited urls. For example, link A references B. B is crawled, and it references A. A wont be recrawled, since its URL is in the set of seen urls.

In addition, all URLS are normalized before they are crawled and stored in the visited urls set. This helps avoid duplicate page crawls. Here's an example:

http://foo.com/people?age=30&filter=joe&sort=up
https://foo.com/people?age=30&sort=up&filter=joe

In this case, the urls differ, but in most cases these will produce the same response. You can read more about roboto's normalization routine here: https://github.com/jculvey/roboto#url-normalization

@martingg88
Copy link
Author

one last question here. does it support stop and resume feature?

@jculvey
Copy link
Owner

jculvey commented May 16, 2015

Nope, not yet. Sorry :/

It's one of the things people have asked for. I'll look into adding it soon.

Would having something like redis or sqlite as a dependency be an issue for you?

@martingg88
Copy link
Author

great. thanks. How about waterline adapter that developer can have his/her choice for any database available in node.js ecosystem?

here is the reference for waterline adapter.

https://github.com/balderdashy/waterline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants