Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSoup HTML parser in separate module #391

Closed
wants to merge 20 commits into from

Conversation

andreasrosdal
Copy link
Contributor

@andreasrosdal andreasrosdal commented Sep 21, 2024

No description provided.

@pbrant
Copy link
Member

pbrant commented Sep 22, 2024

Hey Andreas, thanks for the PR. I appreciate the effort that went into it. I'm afraid it's kind of an example of "hunting mice with an elephant gun" though.

It would be less invasive to add a service interface to allow a user to swap out the DOM parser implementation used by the Swing-based mini-browser (either auto-configured by the presence of the module or explicitly swapped out through configuration). I think I may have suggested this before.

Supporting additional CSS properties is an almost entirely orthogonal problem to the DOM parser in use. This could be done while using an XML, an XHTML, or HTML5 parser to create the DOM.

We do have some experience with copy-n-pasted modules. The old flying-saucer-pdf-itext5 module was effectively a clone of flying-saucer-pdf with package changes and minor API updates.

To put it bluntly, it was a disaster. It had already bitrotted rather badly by the time it was deleted as most contributed fixes only touched flying-saucer-pdf. I'm quite happy that Andrei had the courage to delete it.

It's awesome that you'd like to start experimenting with supporting more CSS properties and adding JavaScript. It is a hugely ambitious task. I'd suggest starting that effort in a separate fork to see how it goes.

@asolntsev
Copy link
Contributor

@andreasrosdal @pbrant In fact, we already have an example showing how to use JSoup to parse HTML:

https://github.com/flyingsaucerproject/flyingsaucer/blob/main/flying-saucer-examples/src/test/java/org/xhtmlrenderer/pdf/PdfFromInvalidHtmlTest.java

But yes, we could improve it even more by service loader mechanism...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants