-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does one calculate split/buffered input hash? #28
Comments
I think a hash function does not promise you that. |
So how do you calculate hash of "big data"? If I wanted to calculate hash of some bigger data, like files I have to buffer them and calculate hash of buffer, then next and next... as far as I understand it in crc32 the intermediate state is hash itself that is fed as seed to next iteration
|
got it! thanks! I will think about that. |
Maybe I expressed it imprecisely and not clearly. It's like adding values. You can add
Assume you have lot of data to hash. If you wanted to hash it in one go it would be like that (
Assume that for practical reasons you have to split them in parts called a, b, c. +========+========+========+ To calculate hash you have to hash part 'a' with initial value = 0 ($istate0), save intermediate state ($istate1), fill the buffer and hash part 'b' with initial value $istate1, save intermediate state ($istate2), fill the buffer and hash part 'c' with initial value $istate2. That would give you result for whole set of data.
You can do that as many times as you want until you process all the data. But to be able to do that you need to have some kind of intermediate value that allows you to start where you stopped off. This intermediate state can be hash itself that is becoming seed in next iteration od some other "structure" that save place where you stopped off. |
Presumably that works now, with the seed also being the current state. As long as you use the same block size, you should get the same result each time, with the caveat that the result won't be the same as if you didn't buffer it - I assume this last part if your problem? |
See my first post. That's exactly what I did and it didn't work. If that doesn't work then what does? So there are only two options:
If you see what's wrong then tell me.
Also I made my test program more general and it can use any slice you want now. Presumably it stil doesn't work like that. I'm actually surprised why didn't I take a look into <wyhash.h>. It would give me an answer quicker. Wyhash works like that (pseudocode):
Just this one line: My guess is if one reverses I don't get exact mathematical side of this hash and I'm talking as regular programmer here, person who would like to use a library function with external interface, without going into details. So if you tell me how to |
I had read your comment, but did find it a little unclear. My point is that you can technically today hash a file/stream piece by piece, but the result won't be the same as if you did all the data at once. As long as you use the same block size, the result will be repeatable (but not the same as the test vectors!). To get the same result, you need to process each 32-byte block, and only then transform/finalise the final block. If I remember, I'll take a stab at it with my c# fork later on 👍 |
So, my approach works. I've created a gist to demonstrate. The idea is as above - you hash block by block, only transforming/computing after the final block. I wrote the code in around 3 minutes, so it's a mess and likely doesn't work for all input lengths - but it demonstrates the principal 😄 Probably best to start at the test file to make sense of it. I've also created an issue for this over on the repo of my C# fork, as this would be really useful. |
If you don't get same result for same data then what is the point os calculating hash? You can equally well use random number with the same result. Few things. I don't have nor am intending to have cs compiler. I don't know language anyways. So I cannot test your |
That's just silly. You will get the same result as long as you use the same block size. So if this is for internal use, it works. I'm just trying to help out if you have an immediate need.
Yes, it's essentially the same as the original c version, just a different public API (yeah, in my gist it's terrible, but I'll iterate on that!). I added comments to the tests to try to explain, but as I mentioned, I didn't spend any time on it, so go easy 😅.
I'm no cryptographer, I just have a keen interest in the field; the approach I took is tried and tested, but I absolutely defer to 王一前辈 (WangYi) with regards to any refactoring of the original c source to enable hashing in chunks!
Ouch, that's a bit harsh, IMO 😄. Wyhash is a young hash, and particularly well suited to hashing small keys. But of course I do agree that the ability to hash streams is extremely useful. |
It's not. Till yesterday I didn't know there are hashes that cannot split the data. Silly me.
I'm not a cryptographer either. My math skills are good but not good enough to do that. My interest in this come from sheer interest in hashes as recently I tried to improve string comparison program and hashes proved to be the answer. Enough to say that
Sorry, I'm not good in beating around the bush sometimes. When it comes to speed i made small experiment: the results I got are somewhat disappointing, especially after smahasher claims that wyhash is much faster than xxhash.
Feel free to test it and say if anything is wrong with that. |
Sorry to the delay as I am busy recently. |
@wangyi-fudan personally, I think this is right for the "reference implementation", as long as the design is such that stream hashing is possible (which it certainly is for v1; I haven't looked at the latest version yet). For real-world development, I'd prefer to use implementations with documentation and dev-friendly APIs - these exist already at least for C#, Rust and Go. I haven't written any real C or C++ for a decade, but I might even have a crack at one of these myself if I find time. |
What is intermediate state of hash in case of split/buffered input? I thought it was seed and I used it as has intermediate hash for every slice. But I made a simple test and failed on first sliced input (at second iteration). Assume I get a date which is a string "abc" and split it in 1-byte slices then is it:
wyhash("abc", len, seed) == wyhash("c", len_slice, wyhash("b", len_slice, wyhash("a", len_slice, seed)))
?my output is like this ($xx means seed):
#reference: seed: $0000000000000000
"a" : 31db9c4e34072a5f
"ab" : 8bb082c4c3cc236e
"abc" : e3db0f558c63ddee
"a" : 31db9c4e34072a5f $0000000000000000
---- summary #a ----
"a" : 31db9c4e34072a5f/31db9c4e34072a5f (SUCCEEDED) // this is ok
"ab" : 31db9c4e34072a5f $0000000000000000 // first iteration is ok
"b" : 8efeffb8472c70e0 $31db9c4e34072a5f // second is already not
---- summary #ab ----
"ab" : 8efeffb8472c70e0/8bb082c4c3cc236e (FAILED)
"abc" : 31db9c4e34072a5f $0000000000000000
"bc" : 8efeffb8472c70e0 $31db9c4e34072a5f
"c" : e17fff3caab4cb76 $8efeffb8472c70e0
---- summary #abc ----
"abc" : e17fff3caab4cb76/e3db0f558c63ddee (FAILED)
If this is right then what is wrong with this approach or code. If it's wrong then what is the right way?
I attached my test programs and their results.
wyhash_test-1abc-issue#28.tar.gz
The text was updated successfully, but these errors were encountered: