-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node type for Julia syntax #7
Comments
I've been reading a bit more Roslyn stuff, in particular https://github.com/KirillOsenkov/Bliki/wiki/Roslyn-Immutable-Trees and https://www.dotnetcodegeeks.com/2015/05/inside-the-net-compiler-platform-performance-considerations-during-syntax-analysis-speakroslyn.html. It seems to me that they chose this whole green/red node design for memory and performance reasons, but that one really only gets those benefits if one expects the red tree to only be materialized sometimes, and then partially. I am wondering whether that is a) actually ever the case in the LS, and b) still necessary 20 years later? I should say that this has been Zac's position all along :) On a) it seems to me that there is always some analysis that we will want to run after every single edit. In particular, it seems to me that we would want to always run our diagnostics/linting, and that we will always need a full tree for that. So right now I have a hard time seeing any scenario where we would have a fully materialized green tree, but no red tree. If that is so, then this whole two-part design is probably only creating more memory pressure, right? On b), it is probably good to remember that the Roslyn design was probably created about 20 years ago... So this was a design that allowed real-time editing of source code 20 years ago. To me it seems that the complexity of code hasn't really gone up, whereas compute power of course has, so maybe this is just all complexity that is no longer needed? So right now I am tempted to not go with the JuliaSyntax green/red design, but instead just create a new type structure that mimics the Roslyn red tree. If we were, some of the points from my previous post here wouldn't fully apply anymore. But a key question is whether there are other benefits to this red/green design? So far I haven't found any, but maybe others have? |
Another observation: Roslyn's red tree has different types for every kind of syntax node... Have we thought about that? Is there a downside to that? |
A few comments on this. Firstly, I think the design space for the "right tree" design in Julia is wide open, and I don't think JuliaSyntax's One good thing about JuliaSyntax's parser is that it doesn't use any tree type at all: the tree is represented in a token and range array. This is good for low GC pressure during parsing. Given this, one can opt out of I do have a soft spot for
IIUC rust-analyzer has something called In terms of generalizing the JuliaSyntax tree types to fit other use cases, @timholy has already started this with JuliaDebug/Cthulhu.jl#345
I've thought about this but I don't really like it. It makes all children abstractly typed and forced dynamic dispatch in a lot of places where we'd otherwise have static dispatch if we used a single node type like An apparent benefit of having many node types is that each type can have named child accessors specifically for that syntactic construct. Then users don't need to remember what Maybe the only place multiple node types would be really neat would be if we wanted the AST to be extensible with new node types. In a crude sense Another observation related to GC, is that for certain use cases one doesn't need to build pointer-based tree data structure at all. It's possible and efficient to represent the entire thing as a small number of arrays and just use cursors to access nodes rather than a pointer to an individual node |
Yep, that completely resonates. I want to use this issue here to figure out what the right tree type for the language server would look like.
Nor by me ;) So that is why I am tempted to just follow the design/lead of something else. The Roslyn design is clearly motivated to enable exactly the kind of functionality that we want to provide with the LS, and it was built by some of the most experienced folks in that space, so in my mind that makes it a good starting point. In particular the red tree, after all they managed to built an amazing diagnostic, refactoring etc. functionality on top of that, so presumably it is a useful API for that kind of stuff.
Yep, that is really fantastic! I do think that presumably the most useful way to do this is to utilize the parser from JuliaSyntax, but maybe have our own tree node types for the LS in the end.
Yep, that then definitely doesn't make it the right type for the LS.
I feel the same way. I am just wondering what benefit it would have to go through two layers of tree types for the LS... The Roslyn rational is simple and clear: memory and GC pressure. But I just don't see how that applies if one builds a red tree for every green tree as well. Do you have a sense why Rust went for the green/red tree design? Are they building the red trees lazily/on demand?
Julia dynamic dispatch is presumably a lot slower than a vtable call say in C#, right? In that case it would presumably be a really bad idea :) |
Yea I suggest you create a tree type which works for you, and if it makes sense for general use we can incorporate it back into JuliaSyntax at some point. We may need to generalize
I should add, as a caveat, that a link to the green tree is always available (as
I suppose it depends on whether you need to materialize the full tree all at once to do linting. (eg, does linting top-level statements one at a time work?)
I don't know why exactly. But I get a sense that the rust-analyzer people are quite obsessed with performance and compactness of these data structures. So they probably thought carefully about it. One possibility is that typical source code contains a lot of reusable green tree fragments (leaves and near-leaves). So practically this compresses the green tree a fair bit and they don't materialize their red tree. I haven't experimented with this though.
Yes, I think (vague memory!) that the rust-analyzer red tree is more like a cursor onto the green tree data, and the precise lifetime/ownership control which Rust provides means that the cursor generally lives allocation free on the program stack. IIRC rust provides access to nontrivia children by iterating the green tree child list on demand and ignoring trivia. They also have a typed accessor interface for the fields of the various syntax node kinds built on top of this. But this isn't based on dynamic dispatch, it's based on pattern matching the node kind, I think. (Somewhat like I'm suggesting with
Presumably yes. I haven't read the generic function dispatch code in detail though. I think there might be optimizations for common special cases and a lookup cache (but tree traversal with may different node types is likely to thrash the cache). |
In cases where all the dispatch types are concrete it's pretty well optimized; it's when you have to do subtyping that things slow down. But if everything is concrete then it's really hardly any different from an |
Yes I don't intend to change to language-based dynamic dispatch. I think pattern matching is the right way to do dispatch on these kind of data structures, in the sense that it can be both expressive and maximally efficient. The rust tooling agrees :) The main problem right now is that we don't have a neat/standard way to express pattern matching over our trees (writing a pile of @BenChung has used MLStyle for this in SemanticAST.jl, for example, you can see this is very succinct here: https://github.com/BenChung/SemanticAST.jl/blob/master/src/analyzer.jl I think we could have something like that eventually (though user code needing to know the meaning of the child ordering of |
Regarding |
Absolutely! I vaguely remember discussing using ECS for annotating nodes once on Zulip? Cool that you're trying it out :) For attaching metadata sparsely to nodes ECS would be great. If sparsity is typical for analyses it's probably the way to go. If non-sparse, a straightforward set of attribute arrays looked up based on node id would be fine instead. I think this is how the graphs ecosystem tends to treat things? |
The other option is to just embed the annotations/attributes directly in the tree nodes which is the approach we've got with |
I've done some experiments with this now. The fact that Some numbers for base/abstractarray.jl which is a moderately large (~100 kb / 3000 line) Julia file:
We see that fully materializing a green tree (or any tree of immutable nodes with high branching factor) is quite GC friendly in Julia! If we used an ECS (or parallel arrays) for attributes and tagged the nodes with IDs, we'd be able to have this even for trees with attachable metadata. None of this applies to |
I've summarized and recorded some of these thoughts here: |
Oh, this is very interesting! So essentially our situation is completely different from Roslyn: there they get rid of GC and memory pressure by aggressively caching and reusing leave nodes (and a smaller number of non-leave nodes). But for us the leave nodes actually don't end up on the heap in the first place. So that is probably great in terms of GC, but at the same time we presumably use more memory than a Roslyn like design? That is probably a good tradeoff? Thinking about this a bit more, this might on the flipside make a design with more structured node types (say the Roslyn red story where there is a |
Yep :) I don't think we should worry too much more about allocation here - it seems we're in quite a different situation from the C# runtime back when Roslyn was being designed.
Unclear, but I suspect not: if Julia separately allocated each node we would need memory for the type tag pointer - that lives in the 8 bytes of heap memory prior to the node itself. |
JuliaLang/JuliaSyntax.jl#268 should help with this - I generalized |
At the moment JuliaSyntax produces
SyntaxNode
for us, and we use that to detect test items and friends. The question, though, is whether that is the right node type for us? My understanding is that it doesn't fully round-trip, which seems like something we need.I guess one alternative would be to parse into CSTParser.EXPR instead, i.e. reuse our old data structure but use the parsing logic from JuliaSyntax. Not clear to me whether that is a good idea? Would probably be useful if those that have worked more with CSTParser chimed in on that question. Things that to me look not super ideal are: 1) I'd prefer it to be an immutable, 2) I dislike the
meta
field, I think we should move to a world where any state from the linter/semantic analysis is stored outside of the syntax tree, 3) theparent
field also doesn't strike me as ideal, as it means that one can never reuse a child-subtree in a new tree, at least if we move to immutable, 4) havingval
asString
is probably also not ideal? Doesn't that mean a lot of allocations that one could potentially avoid by just using say aSubString
that is based on the full original doc, or something like that? 5) I think one of the design goals ofEXPR
was that it is somewhat similar toExpr
with the idea that at some point it might become the official parser for Julia. I think that goal is no longer relevant, so I'm wondering whether one would actually design the data structure differently today if that goal is no longer?So maybe we should create a new type?
Another consideration is how this plays with the JuliaFormatter types, my understanding is that it uses yet another node type internally, somehow? Maybe there is an opportunity to design one node type that can power both JuliaWorkspace and JuliaFormatter?
The text was updated successfully, but these errors were encountered: