Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SWH compatible SHA-1s #102

Open
gousiosg opened this issue Nov 3, 2020 · 5 comments
Open

Implement SWH compatible SHA-1s #102

gousiosg opened this issue Nov 3, 2020 · 5 comments
Assignees

Comments

@gousiosg
Copy link
Contributor

gousiosg commented Nov 3, 2020

Is your feature request related to a problem? Please describe.

We currently compute by SHA1s by calling the SHA1 function on the file contents. To link with SWH's software archive, we need to compute git compatible SHA1s

Describe the solution you'd like

Add a column to the files table (e.g. swh_sha1) and fill it in with git salted SHA1s. As per @zvr:

Git computes the SHA of a file object (blob) by "salting" (prepending) to its contents the word "blob", a space, the size of the file in decimal and a NUL byte.

$ echo 'Hello, World!' | git hash-object -w --stdin
8ab686eafeb1f44702738c8b0f24f2567c36da6d
$ echo -e 'blob 14\0Hello, World!' | shasum
8ab686eafeb1f44702738c8b0f24f2567c36da6d

Additional context

@amottier
Copy link
Contributor

@gousiosg you mention that we are currently computing SHA1 but in the wiki I found a reference to SHA256.

As I'm not very familiar with the overall FASTEN architecture I have trouble locating the code that is actually computes the SHA.
Can you point me to the appropriate direction so I can double check if we are using SHA-1 or SHA-256 and update (if need) the wiki?

Thanks.

@proksch
Copy link
Contributor

proksch commented Mar 1, 2022

Adding a small comment just to raise this issue to the top. Should be an easy fix and should be implemented before the next restart of the whole pipeline.

@proksch
Copy link
Contributor

proksch commented May 19, 2022

@amottier In our call some weeks ago, Cedric said that he will talk to Stefano to understand the details. I am not sure whether this happened before he left the project... do you have any further insights or would you be so nice to initiate the contact with Stefano to understand what exactly is required from our side?

@amottier
Copy link
Contributor

Documentation of the Python tool (swh-identify) to generate the SWHID: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#computing

@mir-am mir-am self-assigned this Aug 1, 2022
@mir-am
Copy link
Contributor

mir-am commented Aug 1, 2022

@proksch, @amottier
To implement this feature, I have looked into the Metadata DB code. I have two findings:
1- ATM, we do NOT insert checksum for the files table. Therefore, its value is always null. Instead of adding a new column, we can insert SWH-compatible SHA into this field.
2- It seems that we do not have access to contents/source code for files of a package version. Therefore, it may not be possible to implement what Georgios described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants