Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl any page, parse only whitelisted and not blacklisted pages #9

Open
ArsalanDotMe opened this issue Dec 29, 2014 · 2 comments
Open

Comments

@ArsalanDotMe
Copy link
Contributor

In many websites, you are only interested in some kind of a page, like a product details page. But what if there are no links to other product detail pages from there? The crawler would get stuck.
I think it could work that the crawler crawls pages freely (respecting robots.txt file of course) but only parses the qualified pages to extract item.

@jculvey
Copy link
Owner

jculvey commented Dec 29, 2014

Here's how I'd like to implement this:

The crawler will get 2 new functions discardItem and discardLinks. If discardItem returns true, then no item piplelines will be invoked. If discardLinks returns true, none of the extracted links will be added to the urlFrontier.

There are times when you want to crawl a page for links, but you don't want the response. In the ecom example you mentioned, an example might be a category page with a filtered view of products. In this case, you wouldn't blacklist the category url pattern, but you would provide the discardItem function, perhaps like so:

crawler.discardItem = function(res) {
  if (res.match(/cat=/)) {
    return true;
  }
}

There's other times when you want to crawl a page for an item but don't want to follow any links. Some crawler frameworks have a maxDepth param to try to solve this, but that doesn't work as well, since many times you want to control the crawl depth by the page content.

The whitelist and blacklist would still be useful for other purposes. You would still want to blacklist things like auth pages, since you don't want to even attempt to crawl those, or pages with very slow or large responses that would slow your crawl rate.

How does this sound?

@ArsalanDotMe
Copy link
Contributor Author

Sounds better and it's more flexible too.

On Mon, Dec 29, 2014, 1:35 PM James Culveyhouse [email protected]
wrote:

Here's how I'd like to implement this:

The crawler will get 2 new functions discardItem and discardLinks. If
discardItem returns true, then no item piplelines will be invoked. If
discardLinks returns true, none of the extracted links will be added to
the urlFrontier.

There are times when you want to crawl a page for links, but you don't
want the response. In the ecom example you mentioned, an example might be a
category page with a filtered view of products. In this case, you wouldn't
blacklist the category url pattern, but you would provide the discardItem
function, perhaps like so:

crawler.discardItem = function(res) {
if (res.match(/cat=/)) {
return true;
}
}

There's other times when you want to crawl a page for an item but don't
want to follow any links. Some crawler frameworks have a maxDepth param to
try to solve this, but that doesn't work as well, since many times you want
to control the crawl depth by the page content.

The whitelist and blacklist would still be useful for other purposes. You
would still want to blacklist things like auth pages, since you don't want
to even attempt to crawl those, or pages with very slow or large responses
that would slow your crawl rate.

How does this sound?


Reply to this email directly or view it on GitHub
#9 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants