Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ClickHouse server TLS option #348

Merged
merged 1 commit into from
Jul 28, 2023
Merged

Conversation

yanjunz97
Copy link
Contributor

@yanjunz97 yanjunz97 commented Jun 21, 2023

Generally, we recommend using Ingress/LoadBalancer for secure connection to ClickHouse.
But we also provide this TLS option at ClickHouse server side.

Currently ClickHouse Operator does not support cert-manager, which means users need to
manually do the cert rotation when using this ClickHouse server side TLS option.

This PR also includes a script to generate all required cert files for ClickHouse.
You could use it with the following command. It will generate the dh parameter, a server
cert, a server key and a CA cert in build/charts/theia/provisioning/tls.
You can provision the CA cert at the client side.

This PR also offers an option to use the certs generated by a functionality in helm.

@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #348 (42731d9) into main (8b8f591) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #348   +/-   ##
=======================================
  Coverage   71.43%   71.43%           
=======================================
  Files          40       40           
  Lines        5220     5220           
=======================================
  Hits         3729     3729           
  Misses       1322     1322           
  Partials      169      169           
Flag Coverage Δ *Carryforward flag
kind-e2e-tests 63.99% <ø> (ø)
python-coverage 56.37% <ø> (ø) Carriedforward from 8b8f591
unit-tests 70.56% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

@yanjunz97
Copy link
Contributor Author

yanjunz97 commented Jun 21, 2023

This PR is still in draft as we haven't adopt the secure connection support at the Client side.
I've only tested it with the local client(without k8s).

Here is a reference to test this ClickHouse TLS support before we add support on the Client side.

  1. Use this command to generate the certs
./hack/generate-clickhouse-cert.sh --SAN IP:127.0.0.1 --dhparam
  1. After setting clickhouse.service.secureConnection.enable to true and deploying the ClickHouse,
    you can use the following command to expose the secure port.
kubectl port-forward svc/clickhouse-clickhouse -n flow-visibility 9440:9440
  1. You can use the following code to connect to the ClickHouse through the secure TCP port.
    Make sure ca-cert.pem is reachable and has the same contents as the one generated in
    build/charts/theia/provisioning/tls.
        caCert, err := os.ReadFile("ca-cert.pem")
	if err != nil {
		klog.Fatal(err)
	}
	caCertPool := x509.NewCertPool()
	successful := caCertPool.AppendCertsFromPEM(caCert)
	if !successful {
		klog.Fatal(err)
	}

	opt := clickhouse.Options{
		Addr: []string{"127.0.0.1:9440"},
		Auth: clickhouse.Auth{
			Database: "default",
			Username: "clickhouse_operator",
			Password: "clickhouse_operator_password",
		},
		TLS: &tls.Config{
			InsecureSkipVerify: false,
			RootCAs:            caCertPool,
		},
	}

	connect := clickhouse.OpenDB(&opt)

@yanjunz97
Copy link
Contributor Author

yanjunz97 commented Jul 14, 2023

This PR is updated referring to: antrea-io/antrea#5171. It adapts the certs generated by a functionality in helm, and removes the scripts in hack to generate the certs.

{{- if .Values.clickhouse.service.secureConnection.enable }}
secure: "yes"
settings:
tcp_port: {{ .Values.clickhouse.service.tcpPort }} # keep for localhost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose tcp_port and http_port when secureConnection is disabled?
I only enclose https_port and tcp_port_secure inisde the double curly braket.
I'm not quite familiar with the settings, are they optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out this! I've moved it out.

Let me explain about how it works.

The port here defines for the port in ClickHouse pod while the port exposed by the service defined in the Service template:

ports:
- name: http
port: {{ .Values.clickhouse.service.httpPort }}
- name: tcp
port: {{ .Values.clickhouse.service.tcpPort }}

Although we may prefer to get consistency between these two, ClickHouse itself has some internal calls on the tcp port 9000 before the ClickHouse server is ready for the ClickHouse service, i.e. 9000 is hard-coded in ClickHouse code and I do not find a way to overwrite them easily. Thus finally my solution is to make the internal tcp port always 9000 while others will be consistent with the customization. And this will not affect the user experience.

@yanjunz97 yanjunz97 force-pushed the tls-ch branch 2 times, most recently from c25f238 to e9aa192 Compare July 24, 2023 23:41
Copy link
Contributor

@yuntanghsu yuntanghsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion on refactoring the description of commonName, ipAddresses, dnsNames these three fields. But I suggest to specify ipAddresses, dnsNames are optional.

build/charts/theia/values.yaml Outdated Show resolved Hide resolved
build/charts/theia/values.yaml Outdated Show resolved Hide resolved
build/charts/theia/values.yaml Outdated Show resolved Hide resolved
build/charts/theia/values.yaml Outdated Show resolved Hide resolved
build/charts/theia/values.yaml Outdated Show resolved Hide resolved
build/charts/theia/values.yaml Outdated Show resolved Hide resolved
build/charts/theia/values.yaml Show resolved Hide resolved
@heanlan
Copy link
Contributor

heanlan commented Jul 25, 2023

I notice we didn't include documentation update in this PR. @yuntanghsu will open a PR for adding documentation of supporting TLS connection between FA and CH. Do we plan to add it together in Yun-Tang's PR?

@yuntanghsu
Copy link
Contributor

I notice we didn't include documentation update in this PR. @yuntanghsu will open a PR for adding documentation of supporting TLS connection between FA and CH. Do we plan to add it together in Yun-Tang's PR?

For the multi-cluster document, as we haven't finished the story of multi-cluster, Elton is thinking to add that document later.

@heanlan
Copy link
Contributor

heanlan commented Jul 25, 2023

For the multi-cluster document, as we haven't finished the story of multi-cluster, Elton is thinking to add that document later.

I suppose we can add documentation on how to configure TLS connection between FA and CH?

@yanjunz97 yanjunz97 force-pushed the tls-ch branch 2 times, most recently from 5e21d7c to d0debba Compare July 25, 2023 23:38
@yanjunz97
Copy link
Contributor Author

I suppose we can add documentation on how to configure TLS connection between FA and CH?

That makes sense. I added some doc for secure connection in the PR. Please take a look when you have time. Thanks!

docs/network-flow-visibility.md Outdated Show resolved Hide resolved
docs/network-flow-visibility.md Outdated Show resolved Hide resolved
}
pods, err := data.clientset.CoreV1().Pods(kubeNamespace).List(context.TODO(), listOptions)
if err != nil {
return nil, fmt.Errorf("failed to list Flow Aggregator Pod: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to list Flow Aggregator Pod: %v", err)
return nil, fmt.Errorf("failed to list ClickHouse Operatorr Pod: %v", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yuntang! I finally removed the changes in e2e tests and added a discussion below.

@@ -1278,6 +1293,34 @@ func (data *TestData) deployFlowVisibilityCommon(chOperatorYML, flowVisibilityYM
if err != nil || rc != 0 {
return fmt.Errorf("error when deploying the ClickHouse Operator YML: %v\n is %s available on the control-plane Node?", err, chOperatorYML)
}
// Check for clickhouse operator pod running again for db connection establishment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense to me. But I found simply check the ClickHouse Operator is not enough and removed these parts. For more details, you can see my comments below

@yanjunz97
Copy link
Contributor Author

I noticed a strange behavior when running the ClickHouse migration e2e test in this PR.

When we update the ClickHouse Operator, it will schedule a task to renew the ClickHouse pod,
i.e. even if we do not apply a new flow-visibility yaml to update the ClickHouse Installation,
it will still restart the ClickHouse pod according to the old manifest.

If we update the ClickHouse Operator followed by the new Flow Visibility manifest, usually it
will be able to combine the two tasks and update the ClickHouse pod once to fulfill the new
Flow Visibility manifest.

However, if we have some changes in the .spec.configuration.clusters.settings for ClickHouseInstallation,
(that's what I tried in this PR, maybe changes in .spec.configuration will have the similar behavior)
the ClickHouse will not being able to combine the two tasks. Instead, it will run the first task
to renew the ClickHouse with old manifest and then the second task to update the ClickHouse
with new manifest.

Here is the error may happen in the e2e test. The root cause is in the downgrading tests.
As we have already applied the lower-version Flow visibility manifest while the ClickHouse
Operator is still applying the high-version Flow visibility manifest, it may find that some
migrators are missing in the configmap. This will lead to the first task of ClickHouseInstallation
stuck for a while(usually several minutes) and timeout.

          Type     Reason                        Age                    From               Message
          ----     ------                        ----                   ----               -------
          Warning  MultiplePodDisruptionBudgets  4m54s                  controllermanager  Pod "flow-visibility"/"chi-clickhouse-clickhouse-0-0-0" matches multiple PodDisruptionBudgets.  Chose "clickhouse-clickhouse" arbitrarily.
          Normal   Scheduled                     4m54s                  default-scheduler  Successfully assigned flow-visibility/chi-clickhouse-clickhouse-0-0-0 to kind-worker
          Warning  MultiplePodDisruptionBudgets  4m54s (x3 over 4m54s)  controllermanager  Pod "flow-visibility"/"chi-clickhouse-clickhouse-0-0-0" matches multiple PodDisruptionBudgets.  Chose "clickhouse" arbitrarily.
          Warning  FailedMount                   2m51s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[clickhouse-configmap-volume], unattached volumes=[chi-clickhouse-deploy-confd-clickhouse-0-0 clickhouse-storage-template kube-api-access-4rf4q clickhouse-monitor-coverage clickhouse-configmap-volume chi-clickhouse-common-configd chi-clickhouse-common-usersd]: timed out waiting for the condition
          Warning  FailedMount                   48s                    kubelet            Unable to attach or mount volumes: unmounted volumes=[clickhouse-configmap-volume], unattached volumes=[clickhouse-storage-template kube-api-access-4rf4q clickhouse-monitor-coverage clickhouse-configmap-volume chi-clickhouse-common-configd chi-clickhouse-common-usersd chi-clickhouse-deploy-confd-clickhouse-0-0]: timed out waiting for the condition
          Warning  FailedMount                   44s (x10 over 4m53s)   kubelet            MountVolume.SetUp failed for volume "clickhouse-configmap-volume" : configmap references non-existent config key: 000005_0-6-0.down.sql
2023/07/26 01:25:18 Exporting test logs to '/home/runner/work/theia/theia/log/TestMigrate/beforeTeardown.Jul26-01-25-18'
2023/07/26 01:25:19 Error when running kubectl command on control-plane Node, cmd:kubectl -n flow-visibility logs --all-containers chi-clickhouse-clickhouse-0-0-0
stdout:
stderr:Error from server (BadRequest): container "clickhouse-monitor" in pod "chi-clickhouse-clickhouse-0-0-0" is waiting to start: ContainerCreating

This should not break users usage as after several tries it will start to apply the second task
and everything keeps going on. But it will break the e2e tests. I feel simply extending the
timeout is not a good choice (and it turns out extending the current 6-min-waiting for
ClickHouse pod to 9-min is still not enough). But I also do not come up with a better solution.
The only thought from my side is we may consider to apply the new Flow Visibility yaml
after the first task is finished. But that's a little bit difficult to track.

As the changes in .spec.configuration.clusters.settings is not a must have in this PR,
I removed that for now. And I log this problem here in case we may need to revisit it in
the future. I also would like to hear your opinions on this.

cc @heanlan @yuntanghsu @tushartathgur

Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yanjunz97 yanjunz97 merged commit 32d4c83 into antrea-io:main Jul 28, 2023
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants