Limit the parsing depth of the html parsing to avoid out of memory situations #71

GoogleCodeExporter · 2015-07-24T15:48:38Z

What steps will reproduce the problem?

(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor 
is just fine for demonstration purposes.

With 8GB of JVM-memory, this will result in an out of memory exception. 

Attached is a patch, which allows limiting the amount of TextBlocks being 
created/appended by boilerpipe. If that limit is reached, boilerpipe will 
ignore all further content from the parsed input.

Original issue reported on code.google.com by [email protected] on 25 Nov 2013 at 4:29

Attachments:

boilerpipe-core.patch.tar.gz

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-07-24T15:48:38Z

Please change type to "enhancement"

Original comment by [email protected] on 26 Nov 2013 at 8:13

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Jul 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit the parsing depth of the html parsing to avoid out of memory situations #71

Limit the parsing depth of the html parsing to avoid out of memory situations #71

GoogleCodeExporter commented Jul 24, 2015

GoogleCodeExporter commented Jul 24, 2015

Limit the parsing depth of the html parsing to avoid out of memory situations #71

Limit the parsing depth of the html parsing to avoid out of memory situations #71

Comments

GoogleCodeExporter commented Jul 24, 2015

GoogleCodeExporter commented Jul 24, 2015