Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this valid for index size reduction? #98

Open
waddyano opened this issue Feb 18, 2022 · 5 comments
Open

Is this valid for index size reduction? #98

waddyano opened this issue Feb 18, 2022 · 5 comments

Comments

@waddyano
Copy link

This might really just apply to ice but thought it might be better to ask here

I have been experimenting with using bluge for indexing text files. Main fields are indexed and not stored. To me the index size is rather large so I have been looking at ways to reduce the size.

After cobbling together some code to print what data consumed space I found the bulk of it by far is Location lists in the posting data.

Looking at what the processing generated I tried just storing all location information as deltas. End as the offset from start and the next start as the offset from the previous end except for the first. Everything seems to stay in increasing sequence and is always processed in sequence so this seems to work fine. This is per location list. Since everything is varint's the smaller the integers the smaller the space needed for a little bit of arithmetic during read.

The second observation was that it didn't seem necessary to store the field number if every location - so I removed it and just picked it up from the dictionary the list belonged to.

I did create a version 2 format and locally have code which can read/write both formats in one module.

So far this seems to work and reduced my index size by 38% - is there some case where this will go wrong? And is their interest in me trying to put together an real change for this or I just create my own segment plugin.

@mschoch
Copy link
Member

mschoch commented Feb 18, 2022

Thanks for looking into this. The delta encoding for locations makes sense, but I would want to review the changes to better understand the impact of the change. Regarding the field number, the reason it has to be stored for every location is to support searching composite fields, and being able to remember which original field it came from. It's a useful feature, but I'm certainly open to ideas to save space wasted in this area.

I would be open to reviewing the changes for the delta encoding, changing the field number may require further discussion.

@waddyano
Copy link
Author

Thanks for the reply - will try and put something together for you to look at soon - and also see if I can study composite fields

@waddyano
Copy link
Author

Running the tests helped me see the composite problem. Should have done that earlier. Since then I have coped with composite fields and thought of more improvements.

For reference the current state of my optimizations is this commit waddyano/ice@a5ffbee

@mschoch
Copy link
Member

mschoch commented Feb 22, 2022

@waddyano I took a quick look. I didn't review closely, but the approach looks good from reading the description.

One thing I see is that you have sections of code guarded with a condition like if Version == 2 {. We thought over time this would lead to a code-base that was difficult to maintain. Instead we started with the model that we would instead use the semantic version major number to represent the file format. This lets the blugelabs/ice repo support different versions, each on a branch in the repository, then we can tag and release versions as needed. If we follow this approach, master would only need to support v2 files. If we ever need to do an update to v1, we can branch off and release as needed.

@waddyano
Copy link
Author

Thanks for the comments - I am used to the extra complexity with less code duplication but will adjust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants