Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage gharchive.org dataset #20

Open
brillout opened this issue Oct 10, 2018 · 0 comments
Open

Leverage gharchive.org dataset #20

brillout opened this issue Oct 10, 2018 · 0 comments
Labels
enhancement New feature or request

Comments

@brillout
Copy link

In the dataset commits are included in the events callled PushEvent (https://developer.github.com/v3/activity/events/types/#pushevent).

Among others a PushEvent contains:

Key Type Description
commits array An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.)
commits[][sha] string The SHA of the commit.
commits[][message] string The commit message.
commits[][author] object The git author of the commit.
commits[][author][name] string The git author's name.
commits[][author][email] string The git author's email address.
commits[][url] url URL that points to the commit API resource.
commits[][distinct] boolean Whether this commit is distinct from any that have been pushed before.

There are several limitations:

  • A PushEvent contains a maximum of 20 commits. This means that any commit that is above this limit is simply missing in the dataset. Most PushEvent don't hit that limit and contain all the commits (something like 99%). But the problem are initial pushes that could have several thousands of commits. (E.g. A private repo moving to github would have a first PushEvent with a high number of commits.) Missing out on these commits is not okay. We could use the GitHub API for such initial PushEvent that have 20 commits (and potentially thousands of truncated commits). Missing out on subsequent PushEvent commits is probably ok.
  • Commit dates are missing. But we do have the push date. So we could take the push date as coarse approximation of the commit date (assuming that most of the time the date of a git push is within the same approximate time frame as the dates of the commits). But we shouldn't do this approximation for a initial PushEvent that has 20 commits (and potentially thousands of truncated commits).

We could still use the dataset to get a list of repos per user. I expect this list of repos to be mostly exhaustive as:

  • I expect most repoS to start public (We can easily get stats for the ratio how many start private and how many start public. (By checking if the first PushEvent has more than 20 commits.)
  • If you contributed to a private repo, chances are not that low that you contribute to it after it goes open source.
  • Small contributions (only couple of commits) are very unlikely to be missing. (Small contribs most likely only happen in public repoS. Very unlikely to miss out of a small contributions because of truncated subsequent PushEvent commit array.)

We can also use the dataset for repoS that have a first PushEvent with less than 20 commits. If the first PushEvent has less than 20 commits then we can be confident that the repo started public. Then missing out on couple of commits is probably ok: The approximate commit stats would likely be good enough to categorize users as "maintainer"/"gold contrib"/"silver contrib"/"bronze contrib" and show a contribution timeline.

@lourot lourot added the enhancement New feature or request label Oct 10, 2018
@lourot lourot added this to the scaling issues milestone Oct 11, 2018
@lourot lourot removed this from the scaling issues milestone Oct 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants