Hanging seemingly randomly when downloading a list of URLs #74

robintw · 2016-04-29T08:54:03Z

I am trying to run quickscrape to download a large number of papers from PLOS One. I've got a list of URLs to download and have run quickscrape as:

quickscrape -r /mnt/cm-volume/content-mine/PLOS_DOIs_2015.txt -s /mnt/cm-volume/content-mine/journal-scrapers/scrapers/plos.json -o /mnt/cm-volume/content-mine/plos-2015-new2/ -l debug

This seems to work fine for a while, but then the process just hangs after downloading a fulltext.xml file. For example, the end of the output (with debug logging turned on) looks like this:

data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g001.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g002.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g003.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g004.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g005.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g006.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g007.
debug: [scraper]. element results. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g001,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g002,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g003,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g004,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g005,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g006,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g007.
data: [scraper]. element capture failed. license.
debug: [scraper]. selector had no results. //span[contains(concat(' ', normalize-space(@class), ' '), ' license-p ')]. license.
debug: [scraper]. element results. license. .
data: [scraper]. element capture failed. copyright.
debug: [scraper]. selector had no results. //span[starts-with(@itemprop, 'copyright')]/... copyright.
debug: [scraper]. element results. copyright. .
info: [scraper]. download started. fulltext.xml.

I can't see any errors here, and if I try and run that particular URL (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114250) by itself it works fine and downloads both the fulltext.xml and fulltext.pdf.

Does anyone have any idea what might be going on here? It is making it really hard to get a large corpus of articles to mine.

The text was updated successfully, but these errors were encountered:

blahah · 2016-04-29T09:36:10Z

If you're downloading from PLOS you should really use getpapers instead, or download the bulk archive. Scraping is a last resort - it's a bit more server intensive and less reliable. And much slower!

But ignoring that, I'm not sure exactly what's happening from the output you've given. Is it always particular URLs that it hangs on, or is it seemingly random?

robintw · 2016-04-29T09:44:27Z

It is seemingly random, and all of the URLs that it seems to hang on then work fine if I run them individually.

I wasn't aware that getpapers could grab large volumes of PLOS papers - as far as I could see from the documentation it could only search things like EuropePMC, and I'm interested in getting non-biomedical-related papers from PLOS too (basically I'm trying to get all PLOS papers from 2015). Is there a way of doing this with getpapers?

Also, I hadn't heard of the PLOS bulk archive, and can't seem to find much about it on Google. Do you know where I could download a bulk archive from?

blahah · 2016-04-29T09:54:07Z

EuropePMC is not only for biomedical articles (the name is misleading). All of PLOS is there: http://europepmc.org/search?query=%28PUBLISHER:%22Public+Library+of+Science%22%29&page=1.

To get all PLOS papers from 2015 you would do:

--query '(PUBLISHER:"Public Library of Science") AND (FIRST_PDATE:[2015-01-01 TO 2015-12-31])'

robintw · 2016-04-29T10:01:46Z

Ah that's great - thank you. I have that running on my server now :-)

It'd still be good to try and work out what is going on with quickscrape sometime, as most of the journals I'm trying to scrape aren't available nice and easily like PLOS... I just have no idea where to start with the debugging...maybe I just need to drop print statements everywhere in the code and see if I can get an idea of when it hangs.

robintw · 2016-04-29T10:11:25Z

Also, I'm struggling to download very large numbers of papers with getpapers (I've commented on an issue, and think I may have found a workaround) - so I'm intrigued: what was the bulk archive you mentioned?

blahah · 2016-04-29T10:13:51Z

I don't have time to debug today I'm afraid. If you go to the PubMed FTP, you can find a bunch of archives called A-C...tar.gz and so on. The one with a range that covers P will contain all PLOS papers, one archive per journal.

petermr · 2016-05-02T19:03:21Z

The following URL hangs quickscrape:

http://www.tandfonline.com/doi/full/10.13039/501100005071

It's an unresolvable URL ("The requested article is not currently available on this site".) but it should time out and move on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging seemingly randomly when downloading a list of URLs #74

Hanging seemingly randomly when downloading a list of URLs #74

robintw commented Apr 29, 2016

blahah commented Apr 29, 2016 •

edited

Loading

robintw commented Apr 29, 2016

blahah commented Apr 29, 2016

robintw commented Apr 29, 2016

robintw commented Apr 29, 2016

blahah commented Apr 29, 2016

petermr commented May 2, 2016

Hanging seemingly randomly when downloading a list of URLs #74

Hanging seemingly randomly when downloading a list of URLs #74

Comments

robintw commented Apr 29, 2016

blahah commented Apr 29, 2016 • edited Loading

robintw commented Apr 29, 2016

blahah commented Apr 29, 2016

robintw commented Apr 29, 2016

robintw commented Apr 29, 2016

blahah commented Apr 29, 2016

petermr commented May 2, 2016

blahah commented Apr 29, 2016 •

edited

Loading