Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite rbx_xml to use quick_xml instead of xml-rs #280

Closed
wants to merge 111 commits into from

Conversation

Dekkonot
Copy link
Member

@Dekkonot Dekkonot commented Jun 2, 2023

rbx-xml is slow and has trouble processing unicode, among other issues. This can and is solved by simply swapping to a new XML parser. Given the speed, I've decided to go with quick-xml.

Unfortunately, the current parser (xml-rs) is intertwined rather thoroughly with rbx-xml's implementation and swapping it out for quick-xml would be impossible without refactoring large segments of it. Given there's a few other design choices that I wanted to change, I decided to bite the bullet and go for a full rewrite. Initial testing is good and suggests we're looking at 15x the parsing speed in some cases. I have tried to prioritize not copying data when possible, so overall memory should also be low.

Update: This is mostly done now, I believe. Many many tests need to be written but the bulk of the code will not change. With that in mind, here are my thoughts beyond "reviewing this is a nightmare":

  1. This rewrite fundamentally changes how the crate is laid out. Specifically, it's split serializer and deserializer into dedicated modules and the code for (de)serializing data types now lives under those modules. This differs dramatically from the current system we use that implements XmlType for each type and both deserialization and serialization in one place. In my opinion, this is easier to reason with but it is significantly different and may require us to point contributors in the right direction if they're used to the old system. Also, it means we have the potential for element name mismatches between serializing and deserializing. Testing will absolutely catch that, but it's something we didn't have to deal with before.

  2. The deserializer now requires the referent attribute to be set on Item elements. I was very surprised to learn it was not before, but as a result this is also a breaking change. I don't think we should care though.


For posterity, here's a checklist of what needs to be done:

  • Visit properties using the ReflectionDatabase
  • Port data types
  • Redo the serializer
  • Run every test like a million times to ensure I didn't break anything

Simplify logic of XmlReader::eat_text
 - BinaryString no longer wrapped in cdata
 - special floats always serialize in uppercase
Serialize -0 as "0" instead of "-0"
Add deserializer tests for PhysicalProperties
Maintains compatibility across new and old version
@Dekkonot
Copy link
Member Author

Dekkonot commented Oct 4, 2023

Rather than work out what work I'm missing since I started this rewrite, I'm going to close this and instead focus on a less invasive rewrite. I think it will result in a much faster turn around for the port, and in turn it should end up being easier to review. Sad to see my work go, but it will not die here.

@Dekkonot Dekkonot closed this Oct 4, 2023
@Dekkonot Dekkonot deleted the quick-xml branch July 20, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants