Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging seemingly randomly when downloading a list of URLs #74

Open
robintw opened this issue Apr 29, 2016 · 7 comments
Open

Hanging seemingly randomly when downloading a list of URLs #74

robintw opened this issue Apr 29, 2016 · 7 comments

Comments

@robintw
Copy link

robintw commented Apr 29, 2016

I am trying to run quickscrape to download a large number of papers from PLOS One. I've got a list of URLs to download and have run quickscrape as:

quickscrape -r /mnt/cm-volume/content-mine/PLOS_DOIs_2015.txt -s /mnt/cm-volume/content-mine/journal-scrapers/scrapers/plos.json -o /mnt/cm-volume/content-mine/plos-2015-new2/ -l debug

This seems to work fine for a while, but then the process just hangs after downloading a fulltext.xml file. For example, the end of the output (with debug logging turned on) looks like this:

data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g001.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g002.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g003.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g004.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g005.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g006.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g007.
debug: [scraper]. element results. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g001,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g002,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g003,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g004,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g005,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g006,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g007.
data: [scraper]. element capture failed. license.
debug: [scraper]. selector had no results. //span[contains(concat(' ', normalize-space(@class), ' '), ' license-p ')]. license.
debug: [scraper]. element results. license. .
data: [scraper]. element capture failed. copyright.
debug: [scraper]. selector had no results. //span[starts-with(@itemprop, 'copyright')]/... copyright.
debug: [scraper]. element results. copyright. .
info: [scraper]. download started. fulltext.xml.

I can't see any errors here, and if I try and run that particular URL (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114250) by itself it works fine and downloads both the fulltext.xml and fulltext.pdf.

Does anyone have any idea what might be going on here? It is making it really hard to get a large corpus of articles to mine.

@blahah
Copy link
Member

blahah commented Apr 29, 2016

If you're downloading from PLOS you should really use getpapers instead, or download the bulk archive. Scraping is a last resort - it's a bit more server intensive and less reliable. And much slower!

But ignoring that, I'm not sure exactly what's happening from the output you've given. Is it always particular URLs that it hangs on, or is it seemingly random?

@robintw
Copy link
Author

robintw commented Apr 29, 2016

It is seemingly random, and all of the URLs that it seems to hang on then work fine if I run them individually.

I wasn't aware that getpapers could grab large volumes of PLOS papers - as far as I could see from the documentation it could only search things like EuropePMC, and I'm interested in getting non-biomedical-related papers from PLOS too (basically I'm trying to get all PLOS papers from 2015). Is there a way of doing this with getpapers?

Also, I hadn't heard of the PLOS bulk archive, and can't seem to find much about it on Google. Do you know where I could download a bulk archive from?

@blahah
Copy link
Member

blahah commented Apr 29, 2016

EuropePMC is not only for biomedical articles (the name is misleading). All of PLOS is there: http://europepmc.org/search?query=%28PUBLISHER:%22Public+Library+of+Science%22%29&page=1.

To get all PLOS papers from 2015 you would do:

--query '(PUBLISHER:"Public Library of Science") AND (FIRST_PDATE:[2015-01-01 TO 2015-12-31])'

@robintw
Copy link
Author

robintw commented Apr 29, 2016

Ah that's great - thank you. I have that running on my server now :-)

It'd still be good to try and work out what is going on with quickscrape sometime, as most of the journals I'm trying to scrape aren't available nice and easily like PLOS... I just have no idea where to start with the debugging...maybe I just need to drop print statements everywhere in the code and see if I can get an idea of when it hangs.

@robintw
Copy link
Author

robintw commented Apr 29, 2016

Also, I'm struggling to download very large numbers of papers with getpapers (I've commented on an issue, and think I may have found a workaround) - so I'm intrigued: what was the bulk archive you mentioned?

@blahah
Copy link
Member

blahah commented Apr 29, 2016

I don't have time to debug today I'm afraid. If you go to the PubMed FTP, you can find a bunch of archives called A-C...tar.gz and so on. The one with a range that covers P will contain all PLOS papers, one archive per journal.

@petermr
Copy link
Member

petermr commented May 2, 2016

The following URL hangs quickscrape:

http://www.tandfonline.com/doi/full/10.13039/501100005071

It's an unresolvable URL ("The requested article is not currently available on this site".) but it should time out and move on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants