Crawl any page, parse only whitelisted and not blacklisted pages #9

ArsalanDotMe · 2014-12-29T07:50:24Z

In many websites, you are only interested in some kind of a page, like a product details page. But what if there are no links to other product detail pages from there? The crawler would get stuck.
I think it could work that the crawler crawls pages freely (respecting robots.txt file of course) but only parses the qualified pages to extract item.

jculvey · 2014-12-29T08:35:13Z

Here's how I'd like to implement this:

The crawler will get 2 new functions discardItem and discardLinks. If discardItem returns true, then no item piplelines will be invoked. If discardLinks returns true, none of the extracted links will be added to the urlFrontier.

There are times when you want to crawl a page for links, but you don't want the response. In the ecom example you mentioned, an example might be a category page with a filtered view of products. In this case, you wouldn't blacklist the category url pattern, but you would provide the discardItem function, perhaps like so:

crawler.discardItem = function(res) {
  if (res.match(/cat=/)) {
    return true;
  }
}

There's other times when you want to crawl a page for an item but don't want to follow any links. Some crawler frameworks have a maxDepth param to try to solve this, but that doesn't work as well, since many times you want to control the crawl depth by the page content.

The whitelist and blacklist would still be useful for other purposes. You would still want to blacklist things like auth pages, since you don't want to even attempt to crawl those, or pages with very slow or large responses that would slow your crawl rate.

How does this sound?

ArsalanDotMe · 2014-12-29T08:39:35Z

Sounds better and it's more flexible too.

On Mon, Dec 29, 2014, 1:35 PM James Culveyhouse [email protected]
wrote:

Here's how I'd like to implement this:

The crawler will get 2 new functions discardItem and discardLinks. If
discardItem returns true, then no item piplelines will be invoked. If
discardLinks returns true, none of the extracted links will be added to
the urlFrontier.

There are times when you want to crawl a page for links, but you don't
want the response. In the ecom example you mentioned, an example might be a
category page with a filtered view of products. In this case, you wouldn't
blacklist the category url pattern, but you would provide the discardItem
function, perhaps like so:

crawler.discardItem = function(res) {
if (res.match(/cat=/)) {
return true;
}
}

There's other times when you want to crawl a page for an item but don't
want to follow any links. Some crawler frameworks have a maxDepth param to
try to solve this, but that doesn't work as well, since many times you want
to control the crawl depth by the page content.

The whitelist and blacklist would still be useful for other purposes. You
would still want to blacklist things like auth pages, since you don't want
to even attempt to crawl those, or pages with very slow or large responses
that would slow your crawl rate.

How does this sound?

—
Reply to this email directly or view it on GitHub
#9 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl any page, parse only whitelisted and not blacklisted pages #9

Crawl any page, parse only whitelisted and not blacklisted pages #9

ArsalanDotMe commented Dec 29, 2014

jculvey commented Dec 29, 2014

ArsalanDotMe commented Dec 29, 2014

Crawl any page, parse only whitelisted and not blacklisted pages #9

Crawl any page, parse only whitelisted and not blacklisted pages #9

Comments

ArsalanDotMe commented Dec 29, 2014

jculvey commented Dec 29, 2014

ArsalanDotMe commented Dec 29, 2014