Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Add tokens to tree construction tests #100

Open
RReverser opened this issue Jul 21, 2017 · 6 comments
Open

[Suggestion] Add tokens to tree construction tests #100

RReverser opened this issue Jul 21, 2017 · 6 comments

Comments

@RReverser
Copy link
Contributor

RReverser commented Jul 21, 2017

Motivation: Some HTML parsers (e.g. parse5 and our internal parser) provide a streaming mode in which tokenizer works as if it's executed together with tree construction algorithm, and so tokenizer states are correctly adjusted on certain tags.

Such adjustment is necessary for tokenizer to correctly tokenize contents of <script>, <textarea> and other special tags while still allowing streaming pass-through without the overhead of tree construction which might also need to buffer content to move nodes around etc.

Problem: For now, tests for such "tokenizer + parser feedback" combination have to be auto-generated at build time by running tree construction tests in a full parser and observing tokens produced (see https://github.com/inikulin/parse5/blob/master/scripts/generate_parser_feedback_test/index.js for an example). This is potentially error-prone and hardly reusable between different parsers (currently you need to copy the produced tests around).

Suggestion: add tokens property to the existing tree construction tests that would contain adjusted tokens as they should be seen after changing states as per HTML5 spec.


I'm happy to prepare PR myself if this proposal is accepted as it would greatly benefit existing and future streaming HTML parsers, and also, I think, can be used by regular parsers for extra checks.

@inikulin
Copy link
Member

Browsers can benefit from this too, AFAIK most browsers have preparsers that are used for content pre-load. With this addition there will be a convenient way to test it.

@RReverser The only concern I have is that it will be way to hard to add new tests written by hand. Can we have some tool that generates token list for given markup?

@RReverser
Copy link
Contributor Author

RReverser commented Jul 21, 2017

You mean for literally new tests? I guess writing them by hand shouldn't be much worse than writing a tree itself or any other tokenizer tests.

@gsnedders
Copy link
Member

Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them). At the same time, we can't just programmatically generate trees using some implementation because for some the interesting part of the test is something the tree constructor will act like it was there anyway (e.g., <p>foo</p> v. <p>foo)… that said, if we just replaced all known tags with <fake-tag> we should still test everything, I think?

@RReverser
Copy link
Contributor Author

RReverser commented Jul 21, 2017

At the same time, we can't just programmatically generate trees

Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted.

For example, this is how one test generated by mentioned parse5 script currently looks like:

{
    "description": "<script><div></script></div><title><p></title><p><p>",
    "input": "<script><div></script></div><title><p></title><p><p>",
    "output": [
        [
            "StartTag",
            "script",
            {}
        ],
        [
            "Character",
            "<div>"
        ],
        [
            "EndTag",
            "script"
        ],
        [
            "EndTag",
            "div"
        ],
        [
            "StartTag",
            "title",
            {}
        ],
        [
            "Character",
            "<p>"
        ],
        [
            "EndTag",
            "title"
        ],
        [
            "StartTag",
            "p",
            {}
        ],
        [
            "StartTag",
            "p",
            {}
        ]
    ]
},

@RReverser
Copy link
Contributor Author

Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them).

I guess... but for the start, would you accept a PR that adds tokens to tree-construction tests, and then we can discuss tokenizer tests separately?

@gsnedders
Copy link
Member

gsnedders commented Aug 10, 2017

At the same time, we can't just programmatically generate trees

Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted.

I meant for the tokenizer tests.

I guess... but for the start, would you accept a PR that adds tokens to tree-construction tests, and then we can discuss tokenizer tests separately?

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants