How does one calculate split/buffered input hash? #28

tansy · 2019-07-09T13:44:53Z

What is intermediate state of hash in case of split/buffered input? I thought it was seed and I used it as has intermediate hash for every slice. But I made a simple test and failed on first sliced input (at second iteration). Assume I get a date which is a string "abc" and split it in 1-byte slices then is it:
wyhash("abc", len, seed) == wyhash("c", len_slice, wyhash("b", len_slice, wyhash("a", len_slice, seed))) ?

		seed = _seed;
		cptr = (char*)_data;
		len = _data_len;
		for(j=0; j<len; j++)
			{
			hash = wyhash(cptr+j, 1, seed);
			seed = hash;
			}

my output is like this ($xx means seed):

#reference: seed: $0000000000000000
"a" : 31db9c4e34072a5f
"ab" : 8bb082c4c3cc236e
"abc" : e3db0f558c63ddee

"a" : 31db9c4e34072a5f $0000000000000000
---- summary #a ----
"a" : 31db9c4e34072a5f/31db9c4e34072a5f (SUCCEEDED) // this is ok

"ab" : 31db9c4e34072a5f $0000000000000000 // first iteration is ok
"b" : 8efeffb8472c70e0 $31db9c4e34072a5f // second is already not
---- summary #ab ----
"ab" : 8efeffb8472c70e0/8bb082c4c3cc236e (FAILED)

"abc" : 31db9c4e34072a5f $0000000000000000
"bc" : 8efeffb8472c70e0 $31db9c4e34072a5f
"c" : e17fff3caab4cb76 $8efeffb8472c70e0
---- summary #abc ----
"abc" : e17fff3caab4cb76/e3db0f558c63ddee (FAILED)

If this is right then what is wrong with this approach or code. If it's wrong then what is the right way?

I attached my test programs and their results.
wyhash_test-1abc-issue#28.tar.gz

The text was updated successfully, but these errors were encountered:

wangyi-fudan · 2019-07-09T13:48:16Z

I think a hash function does not promise you that.

tansy · 2019-07-09T15:48:29Z

So how do you calculate hash of "big data"? If I wanted to calculate hash of some bigger data, like files I have to buffer them and calculate hash of buffer, then next and next... as far as I understand it in crc32 the intermediate state is hash itself that is fed as seed to next iteration*, in xxhash it's XXH32_state_t, so what it is in wyhash? Is there any way to calculate hash of data that don't fit in single buffer? If I wanted to calculate hash of Silesia corpus and wasn't able to fit it in single memory buffer** how do I do that?

* for example: CRC32_update_byte()/CRC32_update_buf()
** Same applies to streaming scenario.

wangyi-fudan · 2019-07-09T20:13:06Z

got it! thanks! I will think about that.

tansy · 2019-07-10T00:58:36Z

Maybe I expressed it imprecisely and not clearly.

It's like adding values. You can add y=a+b+c and you can add one part at a time and eventually get whole sum:

istate = 0
  --------------------+
istate += a           |
istate += b           |
istate += c           |
  --------------------+
y = istate

Assume you have lot of data to hash. If you wanted to hash it in one go it would be like that ($ is to show intermediate state of function):

hash(&abc, whole_len, $0000000) == 352441c2

Assume that for practical reasons you have to split them in parts called a, b, c.

+========+========+========+
|....a....................|....b....................|....c....................|
+========+========+========+
^$istate0..........^$istate1..........^$istate2..........^istate3

To calculate hash you have to hash part 'a' with initial value = 0 ($istate0), save intermediate state ($istate1), fill the buffer and hash part 'b' with initial value $istate1, save intermediate state ($istate2), fill the buffer and hash part 'c' with initial value $istate2. That would give you result for whole set of data.

$istate := 0x00000000 // $istate0
hash(&a, slice_len, $0000000) == e8b7be43
$istate := e8b7be43
hash(&b, slice_len, $istate) == 9e83486d
$istate := 9e83486d
hash(&c, slice_len, $istate) == 352441c2
$istate := 352441c2
_end_of_data ?
yes: hash := $istate == 352441c2 // result

You can do that as many times as you want until you process all the data. But to be able to do that you need to have some kind of intermediate value that allows you to start where you stopped off. This intermediate state can be hash itself that is becoming seed in next iteration od some other "structure" that save place where you stopped off.

cocowalla · 2019-07-10T07:17:43Z

Presumably that works now, with the seed also being the current state. As long as you use the same block size, you should get the same result each time, with the caveat that the result won't be the same as if you didn't buffer it - I assume this last part if your problem?

tansy · 2019-07-10T15:17:09Z

See my first post. That's exactly what I did and it didn't work. If that doesn't work then what does? So there are only two options:

it doesn't work like that
there is an error in my program

If you see what's wrong then tell me.
I just checked it once again. Wyhash works correctly with all test vectors (from here).

wyhash("a", 1, 0x0000000000000000); // == 31db9c4e34072a5f; // that's right
wyhash("b", 1, 0x31db9c4e34072a5f); // == 8efeffb8472c70e0 // that's wrong; should be 8bb082c4c3cc236e

Also I made my test program more general and it can use any slice you want now. Presumably it stil doesn't work like that.

I'm actually surprised why didn't I take a look into <wyhash.h>. It would give me an answer quicker.

Wyhash works like that (pseudocode):

wyhash (key, len, seed)
    {
    seed = internal_wyhash(seed, data);
    ;
    return _wymum(seed, len^_wyp5);
    }

Just this one line: return _wymum(seed, len^_wyp5); shows that seed is not a hash although seed is some internal state of wyhash.

My guess is if one reverses _wymum(seed, len^_wyp5) and feeds internal seed to next iteration it might work. Might, because there is seed^=_wyp0; for "unaligned bytes". (They are not literally unaligned as there is switch(len&31) {...} which makes things mixed up)

I don't get exact mathematical side of this hash and I'm talking as regular programmer here, person who would like to use a library function with external interface, without going into details.

So if you tell me how to _un_wymum(seed, len^_wyp5); I may try to do that but it may or may not work. Probably only @wangyi-fudan knows.

cocowalla · 2019-07-10T16:45:44Z

I had read your comment, but did find it a little unclear.

My point is that you can technically today hash a file/stream piece by piece, but the result won't be the same as if you did all the data at once. As long as you use the same block size, the result will be repeatable (but not the same as the test vectors!).

To get the same result, you need to process each 32-byte block, and only then transform/finalise the final block.

If I remember, I'll take a stab at it with my c# fork later on 👍

cocowalla · 2019-07-10T19:51:33Z

So, my approach works. I've created a gist to demonstrate.

The idea is as above - you hash block by block, only transforming/computing after the final block.

I wrote the code in around 3 minutes, so it's a mess and likely doesn't work for all input lengths - but it demonstrates the principal 😄 Probably best to start at the test file to make sense of it.

I've also created an issue for this over on the repo of my C# fork, as this would be really useful.

tansy · 2019-07-10T21:15:59Z

My point is that you can technically today hash a file/stream piece by piece, but the result won't be the same as if you did all the data at once.

If you don't get same result for same data then what is the point os calculating hash? You can equally well use random number with the same result.

Few things. I don't have nor am intending to have cs compiler. I don't know language anyways. So I cannot test your WyHash64Tests.cs. As far as I understand your WyHash.cs is "a bit" different than wyhash.h. You have divided hashing into few steps and that's core of your approach. With wyhash.h it's impossible. Simple as that. Unless I take it apart into pieces and reorganise the cogs to make them do what I want. I respect your skill, I'm not as good but still I'm not going to reinvent the wheel or a food processor for that matter by rewriting the whole algorithm from scratch. That's not what I meant.
The way it is now it's useful to hash strings, and some small data sets and that's all. The whole alleged speed it totally wasted as noone in no real circumstances will utilise it.

cocowalla · 2019-07-10T22:29:30Z

If you don't get same result for same data then what is the point os calculating hash? You can equally well use random number with the same result

That's just silly. You will get the same result as long as you use the same block size. So if this is for internal use, it works. I'm just trying to help out if you have an immediate need.

Few things. I don't have nor am intending to have cs compiler. I don't know language anyways. So I cannot test your WyHash64Tests.cs. As far as I understand your WyHash.cs is "a bit" different than wyhash.h. You have divided hashing into few steps and that's core of your approach.

Yes, it's essentially the same as the original c version, just a different public API (yeah, in my gist it's terrible, but I'll iterate on that!). I added comments to the tests to try to explain, but as I mentioned, I didn't spend any time on it, so go easy 😅.

With wyhash.h it's impossible. Simple as that. Unless I take it apart into pieces and reorganise the cogs to make them do what I want. I respect your skill, I'm not as good but still I'm not going to reinvent the wheel or a food processor for that matter by rewriting the whole algorithm from scratch. That's not what I meant.

I'm no cryptographer, I just have a keen interest in the field; the approach I took is tried and tested, but I absolutely defer to 王一前辈 (WangYi) with regards to any refactoring of the original c source to enable hashing in chunks!

The way it is now it's useful to hash strings, and some small data sets and that's all. The whole alleged speed it totally wasted as noone in no real circumstances will utilise it.

Ouch, that's a bit harsh, IMO 😄. Wyhash is a young hash, and particularly well suited to hashing small keys. But of course I do agree that the ability to hash streams is extremely useful.

tansy · 2019-07-12T14:23:21Z

If you don't get same result for same data then what is the point of calculating hash? You can equally well use random number with the same result

That's just silly. You will get the same result as long as you use the same block size. So if this is for internal use, it works. I'm just trying to help out if you have an immediate need.

It's not. Till yesterday I didn't know there are hashes that cannot split the data. Silly me.

I'm no cryptographer, (...)

I'm not a cryptographer either. My math skills are good but not good enough to do that. My interest in this come from sheer interest in hashes as recently I tried to improve string comparison program and hashes proved to be the answer. Enough to say that strcmp(str1, str2); did the job of (n^2-n)/2 iterations for n=216555 in ~25 min and int i1=wyhash(str1); (...); if (i1==i2){} in ~1.5 min. It's 16 times faster.

The way it is now it's useful to hash strings, and some small data sets and that's all. The whole alleged speed it totally wasted as noone in no real circumstances will utilise it.

Ouch, that's a bit harsh, IMO . Wyhash is a young hash, and particularly well suited to hashing small keys. But of course I do agree that the ability to hash streams is extremely useful.

Sorry, I'm not good in beating around the bush sometimes.
I don't care if it's young or old as long as it does its job. And it does but inability to split the input or in general application and limitations could be mentioned in documentation. Not everyone is wanting or able to read and understand the source. But (almost) everyone should be able to use it as library function.

When it comes to speed i made small experiment:
put silesia.tar in ramdisk (tmpfs) and calculated few hashes: crc32-1 (Sarwate byte-by-byte), crc32-2 (slice-by-8-bytes), xxhash, wyhash (attached; it puts whole file into memory butter and then hashes it wyhash(buffer, sizeof(buffer), 0llu); [wyhash_file-1c.c.gz](https://github.com/wangyi-fudan/wyhash/files/3386849/wyhash_file-1c.c.gz) ).

the results I got are somewhat disappointing, especially after smahasher claims that wyhash is much faster than xxhash.

$ time crc322 silesia.tar
8c7ee36e
real    0m0.299s
# 676 [MiB/s]

$ time xxhsum silesia.tar
a604faf88ddecf47
real    0m0.134s
# 1508 [MiB/s]

$ time wyhash_file-1 silesia.tar
e6b859f26a2ab90b
real    0m0.242s
# 835 [MiB/s]

Feel free to test it and say if anything is wrong with that.

wangyi-fudan · 2019-07-17T23:14:35Z

Sorry to the delay as I am busy recently.
Thanks for your discussions.
I think wyhash is mainly designed for short string hashing. It is not the fastest one for hashing large files.
On my personal way, I use mmap to map large file to memory and do wyhash at once.
Thus I consider to keep the original interface to keep things simple :-)

cocowalla · 2019-07-18T07:10:50Z

@wangyi-fudan personally, I think this is right for the "reference implementation", as long as the design is such that stream hashing is possible (which it certainly is for v1; I haven't looked at the latest version yet).

For real-world development, I'd prefer to use implementations with documentation and dev-friendly APIs - these exist already at least for C#, Rust and Go. I haven't written any real C or C++ for a decade, but I might even have a crack at one of these myself if I find time.

cocowalla mentioned this issue Jul 10, 2019

Incremental hashing cocowalla/wyhash-dotnet#4

Open

wangyi-fudan closed this as completed Jul 17, 2019

cocowalla mentioned this issue Jul 21, 2021

Hashing a file, by chunks #102

Closed

smarchini mentioned this issue Mar 8, 2023

Streaming hash #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does one calculate split/buffered input hash? #28

How does one calculate split/buffered input hash? #28

tansy commented Jul 9, 2019 •

edited

Loading

wangyi-fudan commented Jul 9, 2019

tansy commented Jul 9, 2019 •

edited

Loading

wangyi-fudan commented Jul 9, 2019

tansy commented Jul 10, 2019 •

edited

Loading

cocowalla commented Jul 10, 2019

tansy commented Jul 10, 2019 •

edited

Loading

cocowalla commented Jul 10, 2019 •

edited

Loading

cocowalla commented Jul 10, 2019 •

edited

Loading

tansy commented Jul 10, 2019

cocowalla commented Jul 10, 2019

tansy commented Jul 12, 2019 •

edited

Loading

wangyi-fudan commented Jul 17, 2019

cocowalla commented Jul 18, 2019

How does one calculate split/buffered input hash? #28

How does one calculate split/buffered input hash? #28

Comments

tansy commented Jul 9, 2019 • edited Loading

wangyi-fudan commented Jul 9, 2019

tansy commented Jul 9, 2019 • edited Loading

wangyi-fudan commented Jul 9, 2019

tansy commented Jul 10, 2019 • edited Loading

cocowalla commented Jul 10, 2019

tansy commented Jul 10, 2019 • edited Loading

cocowalla commented Jul 10, 2019 • edited Loading

cocowalla commented Jul 10, 2019 • edited Loading

tansy commented Jul 10, 2019

cocowalla commented Jul 10, 2019

tansy commented Jul 12, 2019 • edited Loading

wangyi-fudan commented Jul 17, 2019

cocowalla commented Jul 18, 2019

tansy commented Jul 9, 2019 •

edited

Loading

tansy commented Jul 9, 2019 •

edited

Loading

tansy commented Jul 10, 2019 •

edited

Loading

tansy commented Jul 10, 2019 •

edited

Loading

cocowalla commented Jul 10, 2019 •

edited

Loading

cocowalla commented Jul 10, 2019 •

edited

Loading

tansy commented Jul 12, 2019 •

edited

Loading