Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPath issues when doc goes out of scope #61

Open
jangernert opened this issue Nov 18, 2019 · 5 comments
Open

XPath issues when doc goes out of scope #61

jangernert opened this issue Nov 18, 2019 · 5 comments
Labels

Comments

@jangernert
Copy link
Contributor

jangernert commented Nov 18, 2019

On my way debugging and separating the issues listed in #60 I came across another thing that isn't related to async or threading.

An xpath context is non-functional after the document from which it was created goes out of scope and is dropped.
There is no warning or error. It just doesn't return any results any more.

Here is a small example. The parse_html function causes the xpath evaluation to return an empty vector, but no unwrap fails. Doing the same thing in the unit test itself (the commented out code) works as expected:

#[cfg(test)]
mod tests {
    use std::fs::File;
    use std::io::Read;
    use libxml::parser::Parser;
    use libxml::xpath::Context;
    use libxml::tree::Node;

    #[test]
    fn test1() {
        let mut file = File::open("html.txt").unwrap();
        let mut html = String::new();
        file.read_to_string(&mut html).unwrap();

        // let parser = Parser::default_html();
        // let doc = parser.parse_string(html).unwrap();
        // let ctx = Context::new(&doc).unwrap();
        
        let ctx = parse_html(&html);
        
        let res = evaluate_xpath(&ctx, "//article/header");
        assert_eq!(res.len(), 1);
    }

    fn parse_html(html: &str) -> Context {
        let parser = Parser::default_html();
        let doc = parser.parse_string(html).unwrap();
        Context::new(&doc).unwrap()
    }

    fn evaluate_xpath(ctx: &Context, xpath: &str) -> Vec<Node> {
        let res = ctx.evaluate(xpath).unwrap();
        res.get_nodes_as_vec()
    }
}

I see that the XPath context does contain a weak reference to the document to prevent exactly this issue. But from my understanding the context needs a strong reference to extend the lifetime of the document.

Weak documentation:

Since a Weak reference does not count towards ownership, it will not prevent the inner value from being dropped, and Weak itself makes no guarantees about the value still being present and may return None when upgraded.

Not sure if I understand everything correctly, but the Document implementing drop directly and freeing the doc pointer seems wrong to me. Shouldn't the Document just drop its reference to the _Document and only when all references to _Document are dropped _Document itself calls xmlFreeDoc
-> oops, misread the code. It works exactly as I just explained. Guess it's a bit confusing that Document is declared in L82 and right after that the drop trait is implemented for _Document which is declared before Document.

@dginev dginev added the bug label Nov 18, 2019
@jangernert
Copy link
Contributor Author

jangernert commented Nov 18, 2019

@dginev
I created a branch that fixes the problem for me: jangernert@5c9591d
There are only 2 questions:

  • does this possibly mess up something else?
  • should this also be done for the objects that are the results of an xpath evaluation? The document reference gets upgraded when the actual nodes are requested from the object anyway. The object would be the only link in the chain that only holds a weak reference to the document. The document going out of scope before calling get_nodes_as_vec would mean a panic. Because the upgrading of the reference fails which is just handled by a simple unwrap in the code.

@dginev
Copy link
Member

dginev commented Nov 18, 2019

@jangernert I need to find some time to look into this more closely. My first instinct is that I would like to keep the Context object using DocumentWeak, and solve the problem differently.

The reason I started using Weak is to avoid getting into a very complex reference-counting model for the wrappers over the C data structures. The idea being that if you are using xpath over a document, it is the author's responsibility to execute it while the document is in scope, or suffer a runtime panic otherwise. Holding a strong reference leaves a lot of room open for subtle memory leaks - especially in programs which process N documents with M xpaths, and - at least to my abilities - it was a lot harder to reason about that model, and to fish out the memory leaks. Finding that a document is not in scope with a runtime panic on the other hand is very direct to both understand and fix.

I need to rush now, but I'd like to come back and understand better the exact problem you are seeing - do you have a use case where you want the xpath to outlive its document? Or would you be happy with a very informative error message when the said panic takes place?

@dginev
Copy link
Member

dginev commented Nov 18, 2019

Here is the PR that introduced the DocumentWeak treatment with a little more details around the memory leaks I was running into prior:

#42 (comment)

@jangernert
Copy link
Contributor Author

jangernert commented Nov 18, 2019

A runtime error would be sufficient I guess. If with some rust magic that error could be moved to compile time I would be even happier.

With the behaviour as it is, it is quite easy to not notice the problem for a long time. Apparently I already encountered it before as instead of functions for some reason I wrote parse_html in my actual crate as a macro that kept the doc alive.
Looking at the code today, that made no sense to me any more and I encountered the bug once again :D

edit:

when the said panic takes place?

Just to be clear: the code example above does not panic for me. That's why I ran into this bug twice already.

Thinking about it: maybe requiring a reference to a document when calling evaluate and similar functions on the context? The compiler wouldn't ensure passing the document the context was created from, but at least it would make things clearer. There could still be a runtime error if the wrong document was passed to it. For a runtime check of the weak reference it would have to be upgraded anyway, right?

@faassen
Copy link

faassen commented Nov 29, 2022

I think I just ran into this one: xpath seemed to work, but the symptom I had is that node.get_type() of nodes returned by xpath always returned None (even though 'get_name()` did return stuff). When I found this issue and changed the code to ensure the document stays alive suddenly it was doing what I expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants