Skip to content

Commit

Permalink
feat: bring changes up to parity with contributions (#414)
Browse files Browse the repository at this point in the history
  • Loading branch information
adrielp committed May 22, 2024
1 parent 0d7c95c commit 3155ac1
Show file tree
Hide file tree
Showing 17 changed files with 802 additions and 754 deletions.
161 changes: 59 additions & 102 deletions receiver/gitproviderreceiver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,82 +23,63 @@ The current default set of metrics common across all vendors can be found in [do
These default metrics can be used as leading indicators to the DORA metrics;
helping provide insight into modern-day engineering practices.

## GitHub Metrics

The current metrics available via scraping from GitHub are:

- [x] Repository count
- [x] Repository branch time
- [x] Repository branch count
- [x] Repository contributor count
- [x] Repository pull request open time
- [x] Repository pull request time to merge
- [x] Repository pull request time to approval
- [x] Repository pull request count | stores an attribute of `pull_request_state` equal to `open` or `merged`

> Note: Some metrics may be disabled by default and have to be explicitly enabled.
> For example, the repository contributor count metric is one such metric. This is
> because this metric relies on the REST API which is subject to lower rate limits.
## GitLab Metrics

The current metrics available via scraping from GitLab are:

- [x] Repository count
- [x] Repository branch time
- [x] Repository branch count
- [ ] Repository contributor count
- [ ] Repository pull request time
- [x] Repository pull request merge time
- [x] Repository pull request approval time
- [x] Repository pull request deployment time

## Getting Started

The collection interval is common to all scrapers and is set to 30 seconds by default.

> Note: Generally speaking, if the vendor allows for anonymous API calls, then you
> won't have to configure any authentication, but you may only see public repositories
> and organizations.
> and organizations. You may run into significantly more rate limiting.
```yaml
gitprovider:
collection_interval: <duration> #default = 30s
collection_interval: <duration> #default = 30s recommended 300s
scrapers:
<scraper1>:
<scraper2>:
...
```

A more complete example using the GitHub & GitLab scrapers with authentication
is as follows:
A more complete example using the GitHub & GitLab scrapers with authentication is as follows:

```yaml
extensions:
basicauth/github:
client_auth:
username: ${env:GH_USER}
password: ${env:GH_PAT}
bearertokenauth/github:
token: ${env:GH_PAT}
bearertokenauth/gitlab:
token: ${env:GITLAB_PAT}
token: ${env:GL_PAT}

receivers:
gitprovider:
initial_delay: 1s
collection_interval: 60s
collection_interval: 300s
scrapers:
# GitHub Scraper settings
github:
metrics:
git.repository.contributor.count:
enabled: true
git.repository.cve.count:
enabled: true
github_org: myfancyorg
#optional query override, defaults to "{org,user}:<github_org>"
search_query: "org:myfancyorg topic:o11yalltheway"
search_query: "org:myfancyorg topic:o11yalltheway" #Recommended optional query override, defaults to "{org,user}:<github_org>"
endpoint: "https://selfmanagedenterpriseserver.com"
auth:
authenticator: basicauth/github
authenticator: bearertokenauth/github
# GitLab scraper settings
gitlab:
metrics:
git.repository.contributor.count:
enabled: true
git.repository.cve.count:
enabled: true
gitlab_org: myfancyorg
search_topic: "o11yalltheway"
auth:
authenticator: bearertokenauth/gitlab

service:
extensions: [basicauth/github, bearertokenauth/gitlab]
extensions: [bearertokenauth/github, bearertokenauth/gitlab]
pipelines:
metrics:
receivers: [..., gitprovider]
Expand All @@ -109,74 +90,50 @@ service:
This receiver is developed upstream in the [liatrio-otel-collector distribution](https://github.com/liatrio/liatrio-otel-collector)
where a quick start exists with an [example config](https://github.com/liatrio/liatrio-otel-collector/blob/main/config/config.yaml)
A Grafana Dashboard exists on the marketplace for this receiver and can be
found [here](https://grafana.com/grafana/dashboards/20976-engineering-effectiveness-metrics/).
The available scrapers are:
| Scraper | Description |
|----------|-------------------------|
| [github] | Git Metrics from [GitHub](https://github.com/) |
| [gitlab] | Git Metrics from [GitLab](https://gitlab.com) |
## Rate Limiting
Given this receiver scrapes data from Git providers, it is subject to rate
limiting. The following section will give some sensible defaults for each
git provider.
### GitHub
The GitHub scraper within this receiver primarily interacts with GitHub's
GraphQL API. The default rate
limit for GraphQL API is 5,000 points per hour (unless your PAT is associated
to a GitHub Enterprise Cloud organization, then your limit is 10,000).
The receiver on average costs 4 points per repository, allowing it to
scrape up to 1250 repositories per hour under normal conditions.
Given this average cost a good collection interval in seconds is:
| [gitlab] | Git Metrics from [GitLab](https://gitlab.com) |
```math
\text{collection\_interval (seconds)} = \frac{4n}{r/3600} + 300
```

```math
\begin{aligned}
\text{where:} \\
n &= \text{number of repositories} \\
r &= \text{hourly rate limit} \\
\end{aligned}
```

$r$ is likely 5000 but there are factors that can change this,
for more information see [GitHub's docs](https://docs.github.com/en/graphql/overview/rate-limits-and-node-limits-for-the-graphql-api#primary-rate-limit).
The $300$ is a buffer to account for this being a rough estimate and to account
for the initial query to grab repositories.

In addition to these primary rate limits, GitHub enforces secondary rate limits
to prevent abuse and maintain API availability. The following secondary limit is
particularly relevant:

- **Concurrent Requests Limit**: The API allows no more than 100 concurrent
requests. This limit is shared across the REST and GraphQL APIs. Since the
scraper creates a goroutine per repository, having more than 100 repositories
returned by the `search_query` will result in exceeding this limit.
## GitHub Metrics
It is recommended to use the `search_query` config option to limit the number of
repositories that are scraped. We recommend one instance of the receiver per
team (note: `team` is not a valid quantifier when searching repositories `topic`
is). Reminder that each instance of the receiver should have its own
corresponding token for authentication as this is what rate limits are tied to.
The current metrics available via scraping from GitHub are:
In summary, we recommend the following:
- [x] Repository count
- [x] Repository contributor count
- [x] Repository branch count
- [x] Repository branch time
- [x] Repository branch commit aheadby count
- [x] Repository branch commit behindby count
- [x] Repository branch line addition count
- [x] Repository branch line deletion count
- [x] Repository pull request open time
- [x] Repository pull request time to merge
- [x] Repository pull request time to approval
- [x] Repository pull request count | stores an attribute of `pull_request.state` equal to `open` or `merged`
- [x] Repository CVE count | stores an attribute of `cve.severity` equal to `low`, `medium`, `high`, or `critical`

- One instance of the receiver per team
- Each instance of the receiver should have its own token
- Leverage `search_query` config option to limit repositories returned to 100 or
less per instance
- `collection_interval` should be long enough to avoid rate limiting (see above
formula), recall these are lagging indicators so a longer interval is acceptable.
## GitLab Metrics

**Additional Resources:**
The current metrics available via scraping from GitLab are:

- [GitHub GraphQL Primary Rate Limit](https://docs.github.com/en/graphql/overview/rate-limits-and-node-limits-for-the-graphql-api#primary-rate-limit)
- [GitHub GraphQL Secondary Rate Limit](https://docs.github.com/en/graphql/overview/rate-limits-and-node-limits-for-the-graphql-api#secondary-rate-limit)
- [x] Repository count
- [ ] Repository contributor count
- [x] Repository branch count
- [x] Repository branch time
- [x] Repository branch commit aheadby count
- [x] Repository branch commit behindby count
- [x] Repository branch line addition count
- [x] Repository branch line deletion count
- [ ] Repository pull request open time
- [ ] Repository pull request time to merge
- [ ] Repository pull request time to approval
- [x] Repository pull request count | stores an attribute of `pull_request.state` equal to `open` or `merged`
- [ ] Repository CVE count | stores an attribute of `cve.severity` equal to `low`, `medium`, `high`, or `critical`

### Updating tests

Expand Down
Loading

0 comments on commit 3155ac1

Please sign in to comment.