understanding pvc vs snapshot clone

gqlo · Jun 18, 2024 · 78bba25 · 78bba25
1 parent 117a2de
commit 78bba25
Show file tree

Hide file tree

Showing 9 changed files with 289 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -3,4 +3,4 @@ This repo includes a list of my technical blogs and notes:
 * [HyperShift with the KubeVirt Provider Cluster Configuration](https://github.com/gqlo/blogs/blob/main/hypershift-kubevirt-cluster-config.md)
 
 
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
diff --git a/cnv-notes.md b/cnv-notes.md
@@ -1,7 +1,7 @@
 CNV related notes
 ===========================================================
 ## Last Updated
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
 
 ## Introduction
 Some notes on CNV templates and related configuration
@@ -78,4 +78,4 @@ C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp
 
 
 
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
diff --git a/hcp-resource-usage-pattern.md b/hcp-resource-usage-pattern.md
@@ -2,7 +2,7 @@ Correlating QPS Rate with Resource Utilization in Self-Managed Hosted Control Pl
 ====================================================
 
 ## Last Updated
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
 
 ## Introduction
 The general availability of [self-managed hosted control planes](https://www.redhat.com/en/blog/unlocking-new-possibilities-general-availability-hosted-control-planes-self-managed-red-hat-openshift) (HCP) with OpenShift Virtualization (KubeVirt) is an exciting milestone. Yet, the true test lies in system performance and scalability, which are both crucial factors that determine success. Understanding and pushing these limits is essential for making informed decisions. This blog offers a comprehensive analysis and general sizing insights for consolidating existing bare metal resources using self-managed HCP with OpenShift Virtualization Provider. It delves into the resource usage patterns of the HCP, examining its relationship with KubeAPIServer QPS rate. Through various experiments, we established the linear regression model between the KubeAPI Server QPS rate and CPU/Memory/ETCD storage utilization, providing valuable insights for efficient resource consolidation and node capacity planning.

diff --git a/hosted-control-plane-with-the-kubevirt-provider.md b/hosted-control-plane-with-the-kubevirt-provider.md
@@ -1,7 +1,7 @@
 Effortlessly And Efficiently Provision OpenShift Clusters With OpenShift Virtualization
 ====================================================
 ## Last Updated
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
 
 ## Introduction
 [Hosted control planes](https://docs.openshift.com/container-platform/4.12/architecture/control-plane.html#hosted-control-planes-overview_control-plane) for Red Hat OpenShift with the KubeVirt provider makes it possible to host OpenShift tenant clusters on bare metal machines at scale. It can be installed on an existing bare metal OpenShift cluster (OCP) environment allowing you to quickly provision multiple guest clusters using KubeVirt virtual machines. The current model allows running hosted control planes and KubeVirt virtual machines on the same underlying base OCP cluster. Unlike the standalone OpenShift cluster where some of the Kubernetes services in the control plane are running as systemd services, the control planes that HyperShift deploy is just another workload which can be scheduled on any available nodes placed in their dedicated namespaces. This post will show the detailed steps of installing HyperShift with the KubeVirt provider on an existing bare metal cluster and configuring the necessary components to launch guest clusters in a matter of minutes.

diff --git a/hypershift-kubevirt-cluster-config.md b/hypershift-kubevirt-cluster-config.md
@@ -1,7 +1,7 @@
 HyperShift with the KubeVirt provider Cluster Configuration
 ===========================================================
 ## Last Updated
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
 
 ## Introduction
 This document covers some detailed cluster configurations that you may find useful in a production environment.

diff --git a/k8s-net-tracing.md b/k8s-net-tracing.md
@@ -1,7 +1,7 @@
 A close look at k8s networking model
 ===========================================================
 ## Last Updated
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
 
 ## Introduction
 This document covers the detailed tracing of how k8s networking model works in a single node openshift cluster environment.
@@ -219,4 +219,67 @@ crictl inspectp 6a0f2d307f4fa4af097705a652a3abf5ae9042d6209bf533e4d73d5d66610475
           },
           {
 ```
-We are only interested in the network namespace so later we can enter that namespace find the network interface of that particular container.
+We are only interested in the network namespace so later we can enter that namespace find the network interface of that particular container. From the output above, we found the network namespace of that particular container, we can enter that namespace by:
+```
+nsenter --net=/host//var/run/netns/6a52c837-2ff1-4832-aa01-055d41d95b3d
+```
+To confirm we are in the right place:
+```
+[root@52-54-00-a7-93-9c /]# ip a
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+    inet 127.0.0.1/8 scope host lo
+       valid_lft forever preferred_lft forever
+    inet6 ::1/128 scope host 
+       valid_lft forever preferred_lft forever
+2: eth0@if106: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
+    link/ether 0a:58:0a:80:00:51 brd ff:ff:ff:ff:ff:ff link-netns 4a15dff6-6f1d-421f-8599-1b8b9f62ede4
+    inet 10.128.0.81/23 brd 10.128.1.255 scope global eth0
+       valid_lft forever preferred_lft forever
+    inet6 fe80::858:aff:fe80:51/64 scope link 
+       valid_lft forever preferred_lft forever
+```
+After confirmed that we are in the right place, let's start a TCP dump session listening to the particular TCP port 8080:
+```
+tcpdump -i eth0 -nn tcp port 8080
+dropped privs to tcpdump
+tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
+listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
+```
+then we can curl the service endpoint from the host:
+```
+curl -L -k httpd-ex-git-demo-sno.apps.sno.example.com
+```
+Here is the full tcpdump within the container:
+```
+tcpdump -i eth0 -nn tcp port 8080
+dropped privs to tcpdump
+tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
+listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
+1.  06:43:20.251037 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [S], seq 652481145, win 65280, options [mss 1360,sackOK,TS val 899985808 ecr 0,nop,wscale 7], length 0
+2.  06:43:20.251101 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [S.], seq 440466640, ack 652481146, win 64704, options [mss 1360,sackOK,TS val 1638289445 ecr 899985808,nop,wscale 7], length 0
+3.  06:43:20.251576 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [.], ack 1, win 510, options [nop,nop,TS val 899985809 ecr 1638289445], length 0
+4.  06:43:20.251676 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [P.], seq 1:339, ack 1, win 510, options [nop,nop,TS val 899985809 ecr 1638289445], length 338: HTTP: GET / HTTP/1.1
+5.  06:43:20.251697 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [.], ack 339, win 503, options [nop,nop,TS val 1638289446 ecr 899985809], length 0
+6.  06:43:20.276392 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [P.], seq 1:295, ack 339, win 503, options [nop,nop,TS val 1638289470 ecr 899985809], length 294: HTTP: HTTP/1.1 200 OK
+7.  06:43:20.276517 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [.], ack 295, win 508, options [nop,nop,TS val 899985834 ecr 1638289470], length 0
+8.  06:43:25.280762 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [F.], seq 295, ack 339, win 503, options [nop,nop,TS val 1638294475 ecr 899985834], length 0
+9.  06:43:25.280915 IP 10.128.0.2.58884 > 10.128.0.81.8080: Flags [F.], seq 339, ack 296, win 508, options [nop,nop,TS val 899990839 ecr 1638294475], length 0
+10. 06:43:25.280946 IP 10.128.0.81.8080 > 10.128.0.2.58884: Flags [.], ack 340, win 503, options [nop,nop,TS val 1638294475 ecr 899990839], length 0
+```
+This tcpdump demonstrates a perfect example of TCP [3 way Handshake](https://www.geeksforgeeks.org/tcp-3-way-handshake-process/) and how HTTP request is handled. See a [list of tcpdump flags](https://amits-notes.readthedocs.io/en/latest/networking/tcpdump.html) and its meaning. 
+1. Client send SYN to establish communication 
+2. SYN + ACK response from server 
+3. Client ACK the reliable connection is established. 
+4. HTTP GET request is sent from client
+5. Server ACK the HTTP GET request 
+6. HTTP Response 200 OK reponse sent from server
+7. Client ACK the HTTP response 
+8. FIN indicates server wants to terminate the connection 
+9. Client response to terminate the connection 
+10. Server send ACK reponse to complete the connection.
+
+But we are curling service endpoint which has the IP of "192.168.122.237", apparentlly there is something in between. So what is this IP "10.128.0.2". Let's find it out.
+
+
+
diff --git a/kernel-proc-diskstats.md b/kernel-proc-diskstats.md
@@ -1,7 +1,7 @@
 Understanding Kernel /proc/diskstats 
 ===========================================================
 ## Last Updated
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
 
 ## Introduction
 proc virtual filesystem contains a hierarchy of specical files that represent the current state of the kernel, running processes and hardware details. Disk I/O related statistics exposed by Promethues come from the kernel raw stats [/proc/diskstats](https://www.kernel.org/doc/Documentation/admin-guide/iostats.rst). I did a few expirements using dd utility to understand how kernel counts a write/read requests to the actual device.

diff --git a/networking-general.md b/networking-general.md
@@ -1,2 +1,2 @@
 
-**Last Updated:** 2024-06-11 12:26 PM
+**Last Updated:** 2024-06-18 11:10 AM
diff --git a/pvc-vs-snapshot-clone.md b/pvc-vs-snapshot-clone.md
@@ -0,0 +1,216 @@
+Understanding the difference pvc/snapshot clone in ceph 
+===========================================================
+## Last Updated
+**Last Updated:** 2024-06-18 11:10 AM
+
+## Introduction
+Cloning a volume via pvc can be quite different compared to cloning from a snapshot. Let's take a closer look at the ceph backend and see what are some of the interesting differences out there. 
+```
+source:
+  pvc:
+    name: rhel9-parent
+```
+
+```
+ source:
+   snapshot:
+     name: rhel9-snapshot
+```
+
+## PVC Clone
+I have created a data volume object which imports the qcow2 image of a VM. We can later use it as the parent image of all other VM clones.
+
+```
+oc get dv | grep parent
+rhel9-parent   Succeeded   100.0%     1          6d5h
+```
+
+```
+oc get pvc | grep parent
+rhel9-parent   Bound    pvc-dddbe126-bf25-4441-a418-effa3a2bf794   21Gi       RWX            ocs-storagecluster-ceph-rbd-virtualization   6d5h
+```
+
+When describing the dv object, we should notice that the source is coming from a http where the qcow2 image is served by a Python HTTP server.
+```
+  source:                                         
+    http:                                         
+      url: http://10.16.29.214:8080/rhel9.4.qcow2
+```
+
+Then we can create a VM by cloning from this pvc. For this VM, I attached two disks: data disk and root disk:
+
+```
+oc get dv | grep "\-1"
+data-1         Succeeded   100.0%                83m
+root-1         Succeeded   100.0%                83m
+```
+describing root-1, we can see that, this root disk is cloning directly from rhel9-parent pvc.
+```
+    Resources:
+      Requests:
+        Storage:         21Gi
+    Storage Class Name:  ocs-storagecluster-ceph-rbd-virtualization
+    Volume Mode:         Block
+  Source:
+    Pvc:
+      Name:       rhel9-parent
+      Namespace:  default
+```
+describing data-1, we get:
+```
+Spec:
+  Pvc:
+    Access Modes:
+      ReadWriteMany
+    Resources:
+      Requests:
+        Storage:         50Gi
+    Storage Class Name:  ocs-storagecluster-ceph-rbd-virtualization
+    Volume Mode:         Block
+  Source:
+    Blank:
+```
+6 snapshots created for os images and one rdb volume for db-noobaa database:
+```
+sh-5.1$ rbd ls ocs-storagecluster-cephblockpool         
+csi-snap-056e20e6-2bd8-4f85-aaed-e60febdfe6ed
+csi-snap-0b042194-f980-42f6-80e3-235dd837509c
+csi-snap-28a25c3b-70fa-4624-bac2-8e9e478f44a1
+csi-snap-2cc688eb-111b-4b72-b65c-05a7b480cb2c
+csi-snap-b07c0baa-0cf8-4f43-aa57-8501e5436f79
+csi-snap-cdee74a1-d0d7-4949-b65e-d2bfd7c911e1
+csi-vol-3e051c30-9ece-4f15-bf70-ac4a05abc201
+```
+
+Let's now figure out the image id of all those volumes so we can go to the ceph backend and poke around and see how they are actually residing in ceph cluster. The image id is stored in persistent volume object and here are the image IDs and its corresponding volumes:
+```
+rhel9-parent: csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2
+root-1: csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
+```
+The interesting thing is that, whenever we do a clone from PVC, a temp image will be created:
+```
+rbd ls ocs-storagecluster-cephblockpool | grep csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
+csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
+csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp
+```
+let's examine the image one by one:
+```
+rbd info ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2
+rbd image 'csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2':
+        size 21 GiB in 5376 objects
+        order 22 (4 MiB objects)
+        snapshot_count: 1
+        id: 41f040cd3e15
+        block_name_prefix: rbd_data.41f040cd3e15
+        format: 2
+        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
+        op_features: clone-parent, snap-trash
+        flags: 
+        create_timestamp: Sat Jun  8 01:43:53 2024
+        access_timestamp: Sat Jun  8 01:43:53 2024
+        modify_timestamp: Sat Jun  8 01:43:53 2024
+```
+```
+Here we see that root-1's image has a parent which is the temp image.
+rbd info ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0
+rbd image 'csi-vol-73f44909-a673-4300-bf02-97c044b51fa0':
+        size 21 GiB in 5376 objects
+        order 22 (4 MiB objects)
+        snapshot_count: 0
+        id: 14f13bda474c8b
+        block_name_prefix: rbd_data.14f13bda474c8b
+        format: 2
+        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
+        op_features: clone-child
+        flags: 
+        create_timestamp: Mon Jun 17 07:35:04 2024
+        access_timestamp: Mon Jun 17 07:35:04 2024
+        modify_timestamp: Mon Jun 17 07:35:04 2024
+        parent: ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp@b142fe70-b793-4671-bfed-cd8bee998c87
+        overlap: 21 GiB
+```
+
+if we look at the temp image, it also has a parent image which is the base image rhel9-parent.
+```
+rbd info ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp
+rbd image 'csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp':
+        size 21 GiB in 5376 objects
+        order 22 (4 MiB objects)
+        snapshot_count: 1
+        id: 14f13b4aa27389
+        block_name_prefix: rbd_data.14f13b4aa27389
+        format: 2
+        features: layering, deep-flatten, operations
+        op_features: clone-parent, clone-child, snap-trash
+        flags: 
+        create_timestamp: Mon Jun 17 07:35:03 2024
+        access_timestamp: Mon Jun 17 07:35:03 2024
+        modify_timestamp: Mon Jun 17 07:35:03 2024
+        parent: ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2@cca6b09b-4808-468c-9df0-cd2f0ad2a64d
+        overlap: 21 GiB
+```
+what ceph did here was to create a temp vlomue clone of the rhel9-parent image and then use that as a parent of the actual vm vloume. The relationship can be describe as : grandparent(rhel9-parent) -> parent(xx-temp) -> child (xx). 
+
+The actual size of each image is identical which makes me think both are a full clone of the grandparent image:
+```
+sh-5.1$ rbd diff ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0 | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'                                                                                                       
+4377.25 MB
+
+sh-5.1$ rbd diff ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2 | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
+4377.25 MB
+
+sh-5.1$ rbd diff ocs-storagecluster-cephblockpool/csi-vol-73f44909-a673-4300-bf02-97c044b51fa0-temp | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'                                                                                            
+4377.25 MB
+```
+In summary, when we do a PVC clone, there will be a temp volume clone created, this seem to workaround the limit of snapshots since the subsquent snapshots will be created off this unique temp volume.
+## Snapshot Clone
+A different method was later introduced to clone from a snapshot like this:
+```
+source:
+  snapshot:
+    namespace: default
+    name: rhel9-snap
+```
+So we need to first create a snapshot of this rhel9-parent image. On the ceph backend, a snapshot image named like this created:
+```
+csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf
+```
+By looking at the metadata of this snapshot image, we see that its parent is indeeded the rhel9-parent image.
+```
+ rbd info  ocs-storagecluster-cephblockpool/csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf
+rbd image 'csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf':
+        size 21 GiB in 5376 objects
+        order 22 (4 MiB objects)
+        snapshot_count: 1
+        id: cee6655654cd1
+        block_name_prefix: rbd_data.cee6655654cd1
+        format: 2
+        features: layering, deep-flatten, operations
+        op_features: clone-parent, clone-child
+        flags: 
+        create_timestamp: Mon Jun 17 12:20:34 2024
+        access_timestamp: Mon Jun 17 12:20:34 2024
+        modify_timestamp: Mon Jun 17 12:20:34 2024
+        parent: ocs-storagecluster-cephblockpool/csi-vol-6ea3447e-0e75-4dd3-9cba-5fee81a7b0b2@231e26c7-eff9-4f56-879c-bb3503e03f8e
+        overlap: 21 GiB
+```
+Also there is no temp volume anymore, but just a child volume of this snapshot:
+```
+rbd info  ocs-storagecluster-cephblockpool/csi-vol-b589183d-78c6-4379-aa7d-f416adfffe66
+rbd image 'csi-vol-b589183d-78c6-4379-aa7d-f416adfffe66':
+        size 21 GiB in 5376 objects
+        order 22 (4 MiB objects)
+        snapshot_count: 0
+        id: 14f13b979213bd
+        block_name_prefix: rbd_data.14f13b979213bd
+        format: 2
+        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
+        op_features: clone-child
+        flags: 
+        create_timestamp: Mon Jun 17 12:28:21 2024
+        access_timestamp: Mon Jun 17 12:28:21 2024
+        modify_timestamp: Mon Jun 17 12:28:21 2024
+        parent: ocs-storagecluster-cephblockpool/csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf@csi-snap-a16525c8-8556-4a88-bca4-f6d542bc6cbf
+        overlap: 21 GiB
+```
+For every subsequent volume clone, it will all point back to the same snapshot image.