scraper directory uses the user-input output directory #95

rossmounce · 2016-12-02T10:19:29Z

I installed quickscrape as per the readme instructions
I cloned the example journal scrapers repo.

I tried the first peerj-384 example in the readme, but it didn't work.

The problem is it appears to be looking for the scraper file inside of the specified output folder!
e.g. instead of looking in:
journal-scrapers/scrapers/peerj.json

it looks for the scraper file in:
peerj-384/journal-scrapers/scrapers/peerj.json

A quick workaround is just to specify output folder as .

$ quickscrape -V
0.4.7
$ node -v
v0.10.48
$ npm -v
2.15.1
$ quickscrape \
>   --url https://peerj.com/articles/384 \
>   --scraper journal-scrapers/scrapers/peerj.json \
>   --output peerj-384
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /home/ross/Downloads/pica/peerj-384/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info

fs.js:439
  return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
                 ^
Error: ENOENT, no such file or directory '/home/ross/Downloads/pica/peerj-384/journal-scrapers/scrapers/peerj.json'
    at Object.fs.openSync (fs.js:439:18)
    at Object.fs.readFileSync (fs.js:290:15)
    at Object.<anonymous> (/home/ross/.nvm/v0.10.48/lib/node_modules/quickscrape/bin/quickscrape.js:138:23)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:945:3
ross@ross-x3:~/Downloads/pica$ quickscrape   --url https://peerj.com/articles/384   --scraper journal-scrapers/scrapers/peerj.json   --output .
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /home/ross/Downloads/pica/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://peerj.com/articles/384
info: [scraper]. URL rendered. https://peerj.com/articles/384.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fig-1-full.png.
info: URL processed: captured 28/34 elements (6 captures failed)
info: all tasks completed

$ tree https_peerj.com_articles_384/
https_peerj.com_articles_384/
├── fig-1-full.png
├── fulltext.html
├── fulltext.pdf
├── fulltext.xml
└── results.json

0 directories, 5 files

The text was updated successfully, but these errors were encountered:

rossmounce · 2016-12-02T10:23:20Z

Looks like this issue has been reported before, but not fixed yet in the main code: #56

tarrow · 2016-12-02T10:23:21Z

yep; these are quite old bugs but since fewer people see to have been interested in quickscrape and work has been going on with updating thresher the code to fix this hasn't made it to the master branch yet. Have a look at tarrow/master for a place where lots of these fixes are.

rossmounce · 2016-12-02T10:25:20Z

ah cool. I shall have a look at tarrow/master then, thx :)

rossmounce · 2016-12-02T10:46:55Z

@tarrow erm... I just realised I don't know how to compile this kind of code from source. tarrow/master is at https://github.com/tarrow/quickscrape right?

npm install --global quickscrape

won't install your quickscrape will it?

How do I install your updated quickscrape?

tarrow · 2016-12-02T10:47:47Z

npm install --global tarrow/quickscrape should do it :)

rossmounce · 2016-12-02T10:49:41Z

cheers. I know literally nothing about npm 😭

rossmounce added the duplicate label Dec 2, 2016

rossmounce closed this as completed Dec 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scraper directory uses the user-input output directory #95

scraper directory uses the user-input output directory #95

rossmounce commented Dec 2, 2016 •

edited

Loading

rossmounce commented Dec 2, 2016

tarrow commented Dec 2, 2016

rossmounce commented Dec 2, 2016

rossmounce commented Dec 2, 2016

tarrow commented Dec 2, 2016

rossmounce commented Dec 2, 2016

scraper directory uses the user-input output directory #95

scraper directory uses the user-input output directory #95

Comments

rossmounce commented Dec 2, 2016 • edited Loading

rossmounce commented Dec 2, 2016

tarrow commented Dec 2, 2016

rossmounce commented Dec 2, 2016

rossmounce commented Dec 2, 2016

tarrow commented Dec 2, 2016

rossmounce commented Dec 2, 2016

rossmounce commented Dec 2, 2016 •

edited

Loading