Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for HTML 5 #282

Open
andreasrosdal opened this issue Mar 8, 2024 · 5 comments
Open

Support for HTML 5 #282

andreasrosdal opened this issue Mar 8, 2024 · 5 comments

Comments

@andreasrosdal
Copy link
Contributor

andreasrosdal commented Mar 8, 2024

FS should support HTML 5.

To update the flyingsaucerproject/flyingsaucer library for essential HTML5 support, focus on key areas that are most impactful for modern web document standards: (chatgpt suggestions:)

  1. HTML5 Parsing: Integrate an HTML5-compliant parser to accurately handle HTML5 documents. This is crucial for recognizing new semantic elements and properly parsing the document structure.

  2. CSS3 Enhancements: Update the CSS rendering engine to support important CSS3 features such as flexbox for layout, media queries for responsive design, and transitions for visual effects. These are foundational for modern web design practices.

  3. Semantic Elements Support: Specifically target support for new semantic elements like <article>, <section>, <nav>, <header>, <footer>, and <figure>. Ensuring these elements are correctly interpreted and rendered is essential for modern web documents.

  4. Form Controls and Input Types: Enhance support for the new form elements and input types introduced in HTML5. This includes types like email, date, range, and color, which are increasingly used in web forms.

  5. JavaScript Interface: Since HTML5 relies on JavaScript for dynamic content, consider how flyingsaucer might either interface with JavaScript or provide hooks for external JavaScript interaction, especially for form validation and handling new input types.

  6. Test Suite for HTML5: Develop a targeted test suite focusing on HTML5 features to ensure compatibility and adherence to standards. Utilize parts of the W3C HTML5 Test Suite for comprehensive coverage.

  7. Documentation and Modular Approach: Update documentation to reflect the support for HTML5 and consider a modular approach for HTML5 features, allowing users to enable specific functionalities as needed. This strategy helps in managing performance implications and maintains backward compatibility.

By concentrating on these aspects, flyingsaucer can significantly improve its HTML5 support, aligning it with current web standards and enhancing its utility for modern web document rendering.

Integrating an HTML5-compliant parser into the flyingsaucerproject/flyingsaucer library involves several detailed steps to ensure accurate handling of HTML5 documents. These steps are crucial for recognizing new semantic elements and properly parsing the document structure:

  1. Evaluate Existing Parser: Assess the capabilities and limitations of the current parsing mechanism in flyingsaucer to understand how it handles HTML and where it falls short with HTML5 content.

  2. Select an HTML5 Parser: Choose an HTML5-compliant parser that can be integrated into flyingsaucer. Popular Java-based parsers like Jsoup or HTMLUnit have strong support for HTML5 and offer a good balance between performance and ease of use.

https://www.w3.org/TR/2011/WD-html5-20110405/
https://html.spec.whatwg.org/

Possibly some implementation details can be copied from:
https://github.com/openhtmltopdf/openhtmltopdf/

@rbri
Copy link

rbri commented Mar 19, 2024

Maybe https://github.com/HtmlUnit/htmlunit-neko is of help here.
This

  • parses html into a (valid) dom tree
  • has no other dependencies
  • is currently in use by HtmlUnit and therefore in many projects
  • active developed

Because my time is limited i can't provide a impl but i will support this if you like...

@rbri
Copy link

rbri commented May 27, 2024

He folks, i did some minor experiments...

  • load the whole monster into eclipse
  • made a run configuration for the browser sample
    image
  • change den env to point to the htmlunit-neko SAX parser
    image
  • add htmlunit-neko to the classpath
    image

Then starting the browser and pointing to an plain html page
image

I think this is not that bad compared to
image

@rbri
Copy link

rbri commented May 27, 2024

Because neko fixes many issues of real world documents, i was also able to open https://www.htmlunit.org/

Before:
image

After:
image

@andreasrosdal
Copy link
Contributor Author

@rbri I would like to encourage you to make a pull request which allows using the neko-htmlunit html parser in Flying Saucer, in a default way without any hassle configuration, because this html parser is clearly better than the current xml sax parser. This could make it much easier to recommend using Flying Saucer to the developers in the company I work, because at the moment FS is no good because it only supports strict xhtml and developers look for alternatives to FS now.

rbri added a commit to rbri/flyingsaucer that referenced this issue Jun 2, 2024
@rbri
Copy link

rbri commented Jun 2, 2024

@andreasrosdal PR is there ;-) if guess we need some discussion about the right way to do it (maybe a service and a different subproject?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants