Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the parsing depth of the html parsing to avoid out of memory situations #71

Open
GoogleCodeExporter opened this issue Jul 24, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?

(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor 
is just fine for demonstration purposes.

With 8GB of JVM-memory, this will result in an out of memory exception. 

Attached is a patch, which allows limiting the amount of TextBlocks being 
created/appended by boilerpipe. If that limit is reached, boilerpipe will 
ignore all further content from the parsed input.

Original issue reported on code.google.com by [email protected] on 25 Nov 2013 at 4:29

Attachments:

@GoogleCodeExporter
Copy link
Author

Please change type to "enhancement"

Original comment by [email protected] on 26 Nov 2013 at 8:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant