Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scraper directory uses the user-input output directory #95

Closed
rossmounce opened this issue Dec 2, 2016 · 6 comments
Closed

scraper directory uses the user-input output directory #95

rossmounce opened this issue Dec 2, 2016 · 6 comments

Comments

@rossmounce
Copy link
Member

rossmounce commented Dec 2, 2016

I installed quickscrape as per the readme instructions
I cloned the example journal scrapers repo.

I tried the first peerj-384 example in the readme, but it didn't work.

The problem is it appears to be looking for the scraper file inside of the specified output folder!
e.g. instead of looking in:
journal-scrapers/scrapers/peerj.json

it looks for the scraper file in:
peerj-384/journal-scrapers/scrapers/peerj.json

A quick workaround is just to specify output folder as .

$ quickscrape -V
0.4.7
$ node -v
v0.10.48
$ npm -v
2.15.1
$ quickscrape \
>   --url https://peerj.com/articles/384 \
>   --scraper journal-scrapers/scrapers/peerj.json \
>   --output peerj-384
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /home/ross/Downloads/pica/peerj-384/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info

fs.js:439
  return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
                 ^
Error: ENOENT, no such file or directory '/home/ross/Downloads/pica/peerj-384/journal-scrapers/scrapers/peerj.json'
    at Object.fs.openSync (fs.js:439:18)
    at Object.fs.readFileSync (fs.js:290:15)
    at Object.<anonymous> (/home/ross/.nvm/v0.10.48/lib/node_modules/quickscrape/bin/quickscrape.js:138:23)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:945:3
ross@ross-x3:~/Downloads/pica$ quickscrape   --url https://peerj.com/articles/384   --scraper journal-scrapers/scrapers/peerj.json   --output .
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /home/ross/Downloads/pica/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://peerj.com/articles/384
info: [scraper]. URL rendered. https://peerj.com/articles/384.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fig-1-full.png.
info: URL processed: captured 28/34 elements (6 captures failed)
info: all tasks completed

$ tree https_peerj.com_articles_384/
https_peerj.com_articles_384/
├── fig-1-full.png
├── fulltext.html
├── fulltext.pdf
├── fulltext.xml
└── results.json

0 directories, 5 files

@rossmounce
Copy link
Member Author

Looks like this issue has been reported before, but not fixed yet in the main code: #56

@tarrow
Copy link
Contributor

tarrow commented Dec 2, 2016

yep; these are quite old bugs but since fewer people see to have been interested in quickscrape and work has been going on with updating thresher the code to fix this hasn't made it to the master branch yet. Have a look at tarrow/master for a place where lots of these fixes are.

@rossmounce
Copy link
Member Author

ah cool. I shall have a look at tarrow/master then, thx :)

@rossmounce
Copy link
Member Author

@tarrow erm... I just realised I don't know how to compile this kind of code from source. tarrow/master is at https://github.com/tarrow/quickscrape right?

npm install --global quickscrape

won't install your quickscrape will it?

How do I install your updated quickscrape?

@tarrow
Copy link
Contributor

tarrow commented Dec 2, 2016

npm install --global tarrow/quickscrape should do it :)

@rossmounce
Copy link
Member Author

cheers. I know literally nothing about npm 😭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants