Skip to content

Commit

Permalink
understanding pvc vs snapshot clone
Browse files Browse the repository at this point in the history
  • Loading branch information
gqlo committed Jun 18, 2024
1 parent 117a2de commit 78bba25
Show file tree
Hide file tree
Showing 9 changed files with 289 additions and 10 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ This repo includes a list of my technical blogs and notes:
* [HyperShift with the KubeVirt Provider Cluster Configuration](https://github.com/gqlo/blogs/blob/main/hypershift-kubevirt-cluster-config.md)


**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM
4 changes: 2 additions & 2 deletions cnv-notes.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
CNV related notes
===========================================================
## Last Updated
**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
Some notes on CNV templates and related configuration
Expand Down Expand Up @@ -78,4 +78,4 @@ C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp



**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM
2 changes: 1 addition & 1 deletion hcp-resource-usage-pattern.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Correlating QPS Rate with Resource Utilization in Self-Managed Hosted Control Pl
====================================================

## Last Updated
**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
The general availability of [self-managed hosted control planes](https://www.redhat.com/en/blog/unlocking-new-possibilities-general-availability-hosted-control-planes-self-managed-red-hat-openshift) (HCP) with OpenShift Virtualization (KubeVirt) is an exciting milestone. Yet, the true test lies in system performance and scalability, which are both crucial factors that determine success. Understanding and pushing these limits is essential for making informed decisions. This blog offers a comprehensive analysis and general sizing insights for consolidating existing bare metal resources using self-managed HCP with OpenShift Virtualization Provider. It delves into the resource usage patterns of the HCP, examining its relationship with KubeAPIServer QPS rate. Through various experiments, we established the linear regression model between the KubeAPI Server QPS rate and CPU/Memory/ETCD storage utilization, providing valuable insights for efficient resource consolidation and node capacity planning.
Expand Down
2 changes: 1 addition & 1 deletion hosted-control-plane-with-the-kubevirt-provider.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Effortlessly And Efficiently Provision OpenShift Clusters With OpenShift Virtualization
====================================================
## Last Updated
**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
[Hosted control planes](https://docs.openshift.com/container-platform/4.12/architecture/control-plane.html#hosted-control-planes-overview_control-plane) for Red Hat OpenShift with the KubeVirt provider makes it possible to host OpenShift tenant clusters on bare metal machines at scale. It can be installed on an existing bare metal OpenShift cluster (OCP) environment allowing you to quickly provision multiple guest clusters using KubeVirt virtual machines. The current model allows running hosted control planes and KubeVirt virtual machines on the same underlying base OCP cluster. Unlike the standalone OpenShift cluster where some of the Kubernetes services in the control plane are running as systemd services, the control planes that HyperShift deploy is just another workload which can be scheduled on any available nodes placed in their dedicated namespaces. This post will show the detailed steps of installing HyperShift with the KubeVirt provider on an existing bare metal cluster and configuring the necessary components to launch guest clusters in a matter of minutes.
Expand Down
2 changes: 1 addition & 1 deletion hypershift-kubevirt-cluster-config.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
HyperShift with the KubeVirt provider Cluster Configuration
===========================================================
## Last Updated
**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
This document covers some detailed cluster configurations that you may find useful in a production environment.
Expand Down
67 changes: 65 additions & 2 deletions k8s-net-tracing.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
A close look at k8s networking model
===========================================================
## Last Updated
**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
This document covers the detailed tracing of how k8s networking model works in a single node openshift cluster environment.
Expand Down Expand Up @@ -219,4 +219,67 @@ crictl inspectp 6a0f2d307f4fa4af097705a652a3abf5ae9042d6209bf533e4d73d5d66610475
},
{
```
We are only interested in the network namespace so later we can enter that namespace find the network interface of that particular container.
We are only interested in the network namespace so later we can enter that namespace find the network interface of that particular container. From the output above, we found the network namespace of that particular container, we can enter that namespace by:
```
nsenter --net=/host//var/run/netns/6a52c837-2ff1-4832-aa01-055d41d95b3d
```
To confirm we are in the right place:
```
[root@52-54-00-a7-93-9c /]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if106: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 0a:58:0a:80:00:51 brd ff:ff:ff:ff:ff:ff link-netns 4a15dff6-6f1d-421f-8599-1b8b9f62ede4
inet 10.128.0.81/23 brd 10.128.1.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::858:aff:fe80:51/64 scope link
valid_lft forever preferred_lft forever
```
After confirmed that we are in the right place, let's start a TCP dump session listening to the particular TCP port 8080:
```
tcpdump -i eth0 -nn tcp port 8080
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
```
then we can curl the service endpoint from the host:
```
curl -L -k httpd-ex-git-demo-sno.apps.sno.example.com
```
Here is the full tcpdump within the container:
```
tcpdump -i eth0 -nn tcp port 8080
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
1. 06:43:20.251037 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [S], seq 652481145, win 65280, options [mss 1360,sackOK,TS val 899985808 ecr 0,nop,wscale 7], length 0
2. 06:43:20.251101 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [S.], seq 440466640, ack 652481146, win 64704, options [mss 1360,sackOK,TS val 1638289445 ecr 899985808,nop,wscale 7], length 0
3. 06:43:20.251576 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [.], ack 1, win 510, options [nop,nop,TS val 899985809 ecr 1638289445], length 0
4. 06:43:20.251676 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [P.], seq 1:339, ack 1, win 510, options [nop,nop,TS val 899985809 ecr 1638289445], length 338: HTTP: GET / HTTP/1.1
5. 06:43:20.251697 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [.], ack 339, win 503, options [nop,nop,TS val 1638289446 ecr 899985809], length 0
6. 06:43:20.276392 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [P.], seq 1:295, ack 339, win 503, options [nop,nop,TS val 1638289470 ecr 899985809], length 294: HTTP: HTTP/1.1 200 OK
7. 06:43:20.276517 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [.], ack 295, win 508, options [nop,nop,TS val 899985834 ecr 1638289470], length 0
8. 06:43:25.280762 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [F.], seq 295, ack 339, win 503, options [nop,nop,TS val 1638294475 ecr 899985834], length 0
9. 06:43:25.280915 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [F.], seq 339, ack 296, win 508, options [nop,nop,TS val 899990839 ecr 1638294475], length 0
10. 06:43:25.280946 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [.], ack 340, win 503, options [nop,nop,TS val 1638294475 ecr 899990839], length 0
```
This tcpdump demonstrates a perfect example of TCP [3 way Handshake](https://www.geeksforgeeks.org/tcp-3-way-handshake-process/) and how HTTP request is handled. See a [list of tcpdump flags](https://amits-notes.readthedocs.io/en/latest/networking/tcpdump.html) and its meaning.
1. Client send SYN to establish communication
2. SYN + ACK response from server
3. Client ACK the reliable connection is established.
4. HTTP GET request is sent from client
5. Server ACK the HTTP GET request
6. HTTP Response 200 OK reponse sent from server
7. Client ACK the HTTP response
8. FIN indicates server wants to terminate the connection
9. Client response to terminate the connection
10. Server send ACK reponse to complete the connection.

But we are curling service endpoint which has the IP of "192.168.122.237", apparentlly there is something in between. So what is this IP "10.128.0.2". Let's find it out.



2 changes: 1 addition & 1 deletion kernel-proc-diskstats.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Understanding Kernel /proc/diskstats
===========================================================
## Last Updated
**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
proc virtual filesystem contains a hierarchy of specical files that represent the current state of the kernel, running processes and hardware details. Disk I/O related statistics exposed by Promethues come from the kernel raw stats [/proc/diskstats](https://www.kernel.org/doc/Documentation/admin-guide/iostats.rst). I did a few expirements using dd utility to understand how kernel counts a write/read requests to the actual device.
Expand Down
2 changes: 1 addition & 1 deletion networking-general.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@

**Last Updated:** 2024-06-11 12:26 PM
**Last Updated:** 2024-06-18 11:10 AM
216 changes: 216 additions & 0 deletions pvc-vs-snapshot-clone.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
Understanding the difference pvc/snapshot clone in ceph
===========================================================
## Last Updated
**Last Updated:** 2024-06-18 11:10 AM

## Introduction
Cloning a volume via pvc can be quite different compared to cloning from a snapshot. Let's take a closer look at the ceph backend and see what are some of the interesting differences out there.
```
source:
pvc:
name: rhel9-parent
```

```
source:
snapshot:
name: rhel9-snapshot
```

## PVC Clone
I have created a data volume object which imports the qcow2 image of a VM. We can later use it as the parent image of all other VM clones.

```
oc get dv | grep parent
rhel9-parent Succeeded 100.0% 1 6d5h
```

```
oc get pvc | grep parent
rhel9-parent Bound pvc-dddbe126-bf25-4441-a418-effa3a2bf794 21Gi RWX ocs-storagecluster-ceph-rbd-virtualization 6d5h
```

When describing the dv object, we should notice that the source is coming from a http where the qcow2 image is served by a Python HTTP server.
```
source:
http:
url: http://10.16.29.214:8080/rhel9.4.qcow2
```

Then we can create a VM by cloning from this pvc. For this VM, I attached two disks: data disk and root disk:

```
oc get dv | grep "\-1"
data-1 Succeeded 100.0% 83m
root-1 Succeeded 100.0% 83m
```
describing root-1, we can see that, this root disk is cloning directly from rhel9-parent pvc.
```
Resources:
Requests:
Storage: 21Gi
Storage Class Name: ocs-storagecluster-ceph-rbd-virtualization
Volume Mode: Block
Source:
Pvc:
Name: rhel9-parent
Namespace: default
```
describing data-1, we get:
```
Spec:
Pvc:
Access Modes:
ReadWriteMany
Resources:
Requests:
Storage: 50Gi
Storage Class Name: ocs-storagecluster-ceph-rbd-virtualization
Volume Mode: Block
Source:
Blank:
```
6 snapshots created for os images and one rdb volume for db-noobaa database:
```
sh-5.1$ rbd ls ocs-storagecluster-cephblockpool
csi-snap-056e20e6-2bd8-4f85-aaed-e60febdfe6ed
csi-snap-0b042194-f980-42f6-80e3-235dd837509c
csi-snap-28a25c3b-70fa-4624-bac2-8e9e478f44a1
csi-snap-2cc688eb-111b-4b72-b65c-05a7b480cb2c
csi-snap-b07c0baa-0cf8-4f43-aa57-8501e5436f79
csi-snap-cdee74a1-d0d7-4949-b65e-d2bfd7c911e1
csi-vol-3e051c30-9ece-4f15-bf70-ac4a05abc201
```

Let's now figure out the image id of all those volumes so we can go to the ceph backend and poke around and see how they are actually residing in ceph cluster. The image id is stored in persistent volume object and here are the image IDs and its corresponding volumes:
```
rhel9-parent: csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2
root-1: csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
```
The interesting thing is that, whenever we do a clone from PVC, a temp image will be created:
```
rbd ls ocs-storagecluster-cephblockpool | grep csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp
```
let's examine the image one by one:
```
rbd info ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2
rbd image 'csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2':
size 21 GiB in 5376 objects
order 22 (4 MiB objects)
snapshot_count: 1
id: 41f040cd3e15
block_name_prefix: rbd_data.41f040cd3e15
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
op_features: clone-parent, snap-trash
flags:
create_timestamp: Sat Jun 8 01:43:53 2024
access_timestamp: Sat Jun 8 01:43:53 2024
modify_timestamp: Sat Jun 8 01:43:53 2024
```
```
Here we see that root-1's image has a parent which is the temp image.
rbd info ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
rbd image 'csi-vol-73f44909-a673-4300-bf02-97c044b51fa0':
size 21 GiB in 5376 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 14f13bda474c8b
block_name_prefix: rbd_data.14f13bda474c8b
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
op_features: clone-child
flags:
create_timestamp: Mon Jun 17 07:35:04 2024
access_timestamp: Mon Jun 17 07:35:04 2024
modify_timestamp: Mon Jun 17 07:35:04 2024
parent: ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp@b142fe70-b793-4671-bfed-cd8bee998c87
overlap: 21 GiB
```

if we look at the temp image, it also has a parent image which is the base image rhel9-parent.
```
rbd info ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp
rbd image 'csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp':
size 21 GiB in 5376 objects
order 22 (4 MiB objects)
snapshot_count: 1
id: 14f13b4aa27389
block_name_prefix: rbd_data.14f13b4aa27389
format: 2
features: layering, deep-flatten, operations
op_features: clone-parent, clone-child, snap-trash
flags:
create_timestamp: Mon Jun 17 07:35:03 2024
access_timestamp: Mon Jun 17 07:35:03 2024
modify_timestamp: Mon Jun 17 07:35:03 2024
parent: ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2@cca6b09b-4808-468c-9df0-cd2f0ad2a64d
overlap: 21 GiB
```
what ceph did here was to create a temp vlomue clone of the rhel9-parent image and then use that as a parent of the actual vm vloume. The relationship can be describe as : grandparent(rhel9-parent) -> parent(xx-temp) -> child (xx).

The actual size of each image is identical which makes me think both are a full clone of the grandparent image:
```
sh-5.1$ rbd diff ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0 | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
4377.25 MB
sh-5.1$ rbd diff ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2 | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
4377.25 MB
sh-5.1$ rbd diff ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
4377.25 MB
```
In summary, when we do a PVC clone, there will be a temp volume clone created, this seem to workaround the limit of snapshots since the subsquent snapshots will be created off this unique temp volume.
## Snapshot Clone
A different method was later introduced to clone from a snapshot like this:
```
source:
snapshot:
namespace: default
name: rhel9-snap
```
So we need to first create a snapshot of this rhel9-parent image. On the ceph backend, a snapshot image named like this created:
```
csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf
```
By looking at the metadata of this snapshot image, we see that its parent is indeeded the rhel9-parent image.
```
rbd info ocs-storagecluster-cephblockpool/csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf
rbd image 'csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf':
size 21 GiB in 5376 objects
order 22 (4 MiB objects)
snapshot_count: 1
id: cee6655654cd1
block_name_prefix: rbd_data.cee6655654cd1
format: 2
features: layering, deep-flatten, operations
op_features: clone-parent, clone-child
flags:
create_timestamp: Mon Jun 17 12:20:34 2024
access_timestamp: Mon Jun 17 12:20:34 2024
modify_timestamp: Mon Jun 17 12:20:34 2024
parent: ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2@231e26c7-eff9-4f56-879c-bb3503e03f8e
overlap: 21 GiB
```
Also there is no temp volume anymore, but just a child volume of this snapshot:
```
rbd info ocs-storagecluster-cephblockpool/csi-vol-b589183d-78c6-4379-aa7d-f416adfffe66
rbd image 'csi-vol-b589183d-78c6-4379-aa7d-f416adfffe66':
size 21 GiB in 5376 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 14f13b979213bd
block_name_prefix: rbd_data.14f13b979213bd
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
op_features: clone-child
flags:
create_timestamp: Mon Jun 17 12:28:21 2024
access_timestamp: Mon Jun 17 12:28:21 2024
modify_timestamp: Mon Jun 17 12:28:21 2024
parent: ocs-storagecluster-cephblockpool/csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf@csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf
overlap: 21 GiB
```
For every subsequent volume clone, it will all point back to the same snapshot image.

0 comments on commit 78bba25

Please sign in to comment.