Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop/resume #2

Open
f1ames opened this issue Nov 29, 2014 · 2 comments
Open

Stop/resume #2

f1ames opened this issue Nov 29, 2014 · 2 comments

Comments

@f1ames
Copy link
Contributor

f1ames commented Nov 29, 2014

I think I saw it in the roadmap.
It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.

@jculvey
Copy link
Owner

jculvey commented Nov 29, 2014

Yeah, this is really lacking right now.

I've been a little torn over how to implement this. In the long term I think it would be cool if there was some sort of admin UI where you could view previous crawl results, start and stop new crawls, and maybe even do a little configuration.

That might be a little heavyweight for some people, so having a simple pause/resume from the command line would be nice.

How would this change sound:

In the crawler you can configure a queue file:

var crawler = new roboto.Crawler({
  startUrls: [
    "https://news.ycombinator.com/",
  ],  
  queueFile: '/var/foo'
});

Then, the url frontier and set of seen urls will periodically be serialized and flushed out to the file as json.

@f1ames
Copy link
Contributor Author

f1ames commented Nov 30, 2014

Well, I have very similar idea. You can configure queue file and crawler periodically serializes data which is necessary for resume.
The flow I was thinking of:

  • if you don't define queueFile it works like current version
  • if you define queueFile it checks if it exists and if it's empty
    • if it's empty, crawler starts from the beginning
    • if it's not empty, crawler deserializes data and starts from this point

If crawler is done, it removes queueFile so next time it starts from the beginning.

nsakovich pushed a commit to nsakovich/roboto that referenced this issue Dec 24, 2015
WEBCLI-824 Add caching support in Devcenter crawler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants