Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-automation: add hetzner testing #2142

Merged
merged 1 commit into from
Sep 9, 2024
Merged

ci-automation: add hetzner testing #2142

merged 1 commit into from
Sep 9, 2024

Conversation

tormath1
Copy link
Contributor

@tormath1 tormath1 commented Jul 22, 2024

In this PR, we bring the "glue" between Flatcar CI and Hetzner Mantle implementation.

NOTE: For this specific cloud provider there is no need to implement garbage collection as the provider is providing us a temporary project to run our tests, the project is deleted right after

Testing done

CI 🟢 : http://jenkins.infra.kinvolk.io:8080/job/container/job/packages_all_arches/4600/cldsv/


Closes flatcar/Flatcar#1412

@tormath1 tormath1 added the main label Jul 22, 2024
@tormath1 tormath1 self-assigned this Jul 22, 2024
@tormath1 tormath1 changed the title [wip] ci-automation: add hetzner testing ci-automation: add hetzner testing Jul 22, 2024
ci-automation/vendor-testing/hetzner.sh Outdated Show resolved Hide resolved

# -- Hetzner --
: ${HETZNER_IMAGE_NAME:='flatcar_production_hetzner_image.bin.bz2'}
: ${HETZNER_amd64_INSTANCE_TYPE:="cx22"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cx plans have limited availability. To reduce flaky pipelines I would recommend cpx11 instead.

https://status.hetzner.com/incident/aa5ce33b-faa5-4fd0-9782-fde43cd270cf

  • Title: Limited availability of CX plans
  • Status: In progress
  • Start: 2024-06-17 10:00 UTC+0
  • Description: Due to increased demand, instances on CX plans (CX22 to CX52) are currently subject to limited availability.
    We will gradually release new capacities as soon as they are available.

@tormath1
Copy link
Contributor Author

@apricote AMD tests look perfect but ARM tests are not happy:

Failed tests:
 cl.internet
 cl.basic
 coreos.ignition.security.tls
 cl.flannel.udp
 coreos.ignition.resource.local

With:

 bash: line 1: ./kolet: cannot execute binary file: Exec format error
     --- FAIL: cl.basic/CloudConfig (1.84s)
             cluster.go:85: kolet: Process exited with status 126
         cluster.go:82: kolet:
 bash: line 1: ./kolet: cannot execute binary file: Exec format error
     --- FAIL: cl.basic/PortSSH (1.84s)
             cluster.go:85: kolet: Process exited with status 126
 --- FAIL: cl.flannel.udp (1096.63s)
         harness.go:628: Cluster failed starting machines: machine "52347143" failed basic checks: ssh unreachable or system not ready: failure checking if machine is running: systemctl is-system-running returned stdout: "starting", stderr: "", err: Process exited with status 1, systemctl list-jobs returned stdout: "JOB UNIT                        TYPE  STATE\n736 etcd-member.service         start running\n583 multi-user.target           start waiting\n718 flannel-docker-opts.service start waiting\n717 flanneld.service            start waiting\n\n4 jobs listed.", stderr: "", err: <nil>
 --- FAIL: coreos.ignition.resource.local (1187.20s)
         resource.go:338: starting client: machine "52347217" failed to start: ssh journalctl failed: time limit exceeded: dial tcp 49.13.137.171:22: connect: connection refused
 --- FAIL: coreos.ignition.security.tls (1193.94s)
         security.go:137: starting client: machine "52347233" failed to start: ssh journalctl failed: time limit exceeded: dial tcp 188.245.121.230:22: connect: connection refused
 FAIL, output in _kola_temp/hetzner-2024-08-27-2021-34

kolet is a binary (part of Kola) uploaded on the tested machine and it runs some "native" test directly on the machine, the "exec format error" usually means that is has been compiled with the wrong arch.

For the other tests, I will try to repro.

Copy link

github-actions bot commented Aug 28, 2024

Build action triggered: https://github.com/flatcar/scripts/actions/runs/10768868921

@apricote
Copy link
Contributor

@tormath1 Can you try again with the patches from flatcar/mantle#553? It works now for me on cl.basic. Did not try the other tests.

@tormath1 tormath1 marked this pull request as ready for review September 2, 2024 13:39
@tormath1 tormath1 requested a review from a team September 2, 2024 13:39
@tormath1
Copy link
Contributor Author

tormath1 commented Sep 2, 2024

@tormath1 Can you try again with the patches from flatcar/mantle#553? It works now for me on cl.basic. Did not try the other tests.

@apricote All good now - a bit flaky on the Kubernetes test but I think it was transient:

         cluster.go:125: failed to pull image "registry.k8s.io/kube-apiserver:v1.28.7": output: E0902 12:46:59.711737    1962 remote_image.go:167] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/kube-apiserver:v1.28.7\": failed to resolve reference \"registry.k8s.io/kube-apiserver:v1.28.7\": unexpected status from HEAD request to https://registry.k8s.io/v2/kube-apiserver/manifests/v1.28.7: 403 Forbidden" image="registry.k8s.io/kube-apiserver:v1.28.7"
         cluster.go:125: time="2024-09-02T12:46:59Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/kube-apiserver:v1.28.7\": failed to resolve reference \"registry.k8s.io/kube-apiserver:v1.28.7\": unexpected status from HEAD request to https://registry.k8s.io/v2/kube-apiserver/manifests/v1.28.7: 403 Forbidden"

--fail-with-body \
--retry 2 \
--silent \
--user-agent "flatcar-ci/unknown" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about why we override the user agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apricote any idea why we need this? I see it's similar to this: https://github.com/hetznercloud/tps-action/blob/dee5dd2546322c28ed8f74b910189066e8b6f31a/get-token.sh#L19 but not sure why we need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a debugging aid?

hetznercloud/tps-action#5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works without the user agent. Having the user agent in requests is helpful for us operating the service as we can easily figure out who is affected by any errors, who is sending too many requests...

Of course, the user agent can easily be faked by anyone, so we don't fully rely on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @apricote for sharing the details. Let's keep the user agent like this in this case as it can help you to monitor the TPS system.

ci-automation/vendor-testing/hetzner.sh Outdated Show resolved Hide resolved
sdk_container/.repo/manifests/mantle-container Outdated Show resolved Hide resolved
ci-automation/vendor-testing/hetzner.sh Show resolved Hide resolved
No need for garbage collection since one temporary project is allocated with 1h of
lifespan for each run.

Signed-off-by: Mathieu Tortuyaux <[email protected]>
Co-authored-by: Julian Tölle <[email protected]>
Copy link
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user-agent issue is a minor thing. Regardless of whether it gets removed or not, the PR looks good.

@tormath1 tormath1 merged commit d9dcc75 into main Sep 9, 2024
1 check failed
@tormath1 tormath1 deleted the tormath1/ci-hetzner branch September 9, 2024 07:46
@tormath1 tormath1 removed the lts label Sep 9, 2024
@tormath1
Copy link
Contributor Author

tormath1 commented Sep 9, 2024

cherry-picked to:

  • flatcar-4081
  • flatcar-4054
  • flatcar-3975

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

[RFE] Hetzner support
3 participants