Skip to content
This repository has been archived by the owner on Oct 31, 2019. It is now read-only.

Question about updating existing cluster vs creating new one in new vcn #182

Open
jferr opened this issue Mar 28, 2018 · 8 comments
Open
Labels

Comments

@jferr
Copy link

jferr commented Mar 28, 2018

Terraform Version

Latest and greatest from dockerhub hashicorp/terraform
See https://github.com/jferr/oci_tf_kuber/blob/master/dockerfile

OCI Provider Version

2.0.7

Terraform Installer for Kubernetes Version

Latest and greatest. I've got my docker container which runs terraform cloning the repo at docker build time and "git pull -prune" and runtime to get the lastest master.

FYI...I'm not a contributor to the project/a "go" developer so looking at the code is not my preferred first option here. We are a paying customer of Oracle OCI.

We are refreshing state w. each run (see https://github.com/jferr/oci_tf_kuber/blob/master/doit.sh).

If I manually delete all compute/load balancers/subnets/vcn's in my compartment then run terraform, it appears to still "see" these deleted entities. For example the screenshot below is partial output from a run immediately after a deletion. It is showing entities in the log which didn't exist in my compartment because I manually deleted them a few minutes before the run

image

I've been working w. oracle support on many related issues with limited success and with a very slow response time so I am moving here for now. Is this the correct avenue for such questions or is there a better place?

this issue above along w an issue where after a run, at certain times we end up w. an undelet-able subnet makes troubleshooting difficult.

A second question. What does this terraform provider use to determine whether to create a new VCN and create a new cluster or to update an existing cluster in an existing vcn in the compartment?

I'm seeing both things happen. Usually if I create the cluster then re-run terraform (this is done via docker so it's a brand new container doing a refresh of state) it is smart enough to see that it's an existing cluster which matches my desired variables...but sometimes it'll just create a new VCN and a new cluster.

Thanks in advance

@jferr
Copy link
Author

jferr commented Mar 28, 2018

In general I've been seeing so many issues that it's hard for me to pin down. I assume that I should be able to update the number of workers in ad's all day long (add or reduce the number) and run terraform and everything should work but that's not the case. For example, after 4 successful runs in a row where I changed the number of worker nodes in each AD between runs, I just went from 1 worker in each of three ad's to 0 in one of them and I got this error.

image

My intent is to have 200 different tfvars files w different settings and copy each in turn, run terraform and not see a single error. Is this reasonable? I won't feel comfortable using oracle cloud/terraform in production until I can run terraform over and over w. different configs w/o a single error.

@owainlewis
Copy link
Member

owainlewis commented Apr 3, 2018

Hi @jferr,

Firstly, sorry if you've had any slow responses.

Terraform stores all state about what it's created in a terraform.tfstate file. Generally this needs to be persisted and used for each run with the same resources so that Terraform can keep track of what it has created (https://www.terraform.io/docs/state/). We notice that this isn't being persisted between runs which at a first guess is likely the cause of many of these issues.

Manually deleting and editing things outside of Terraform will likely causes problems. The un-deletable subnet issue happens when a subnet is referenced by another resource and so it is technically unsafe to delete it. This is most likely caused by the manual deleting and editing of things outside of Terraform's control. In short, if the state file and the state of the world is inconsistent for consecutive runs then it's likely Terraform will get confused. This is generally more a property of Terraform itself than the OCI specific implementation.

The Terraform refresh command is used to reconcile the state Terraform knows about (via its state file) so the state file is needed for Terraform to do the right thing.

Let us know if that helps and please ask if you have any further questions on this.

@jferr
Copy link
Author

jferr commented Apr 3, 2018

Thanks @owainlewis

My thought was that refreshing w. each run is what will allow us to run this via a docker container and should also allow terraform to see an accurate view of infrastructure as-is (e.g. even if something is modified via the OCI console terraform will accurately see "as is" and will reconcile). Is this not true? In our case we are a small group and we can manage a single terraform run at a time by only running terraform via a Jenkins job.

@jferr
Copy link
Author

jferr commented Apr 5, 2018

@owainlewis shouldn't this work? I started w. using an s3 backend for terraform but had a number of issues plus we've had issues where at some point terraform starts failing and we need to delete everything from the oracle console (compute/lb/vcn) and start over. I figure refreshing each time should be the most stable though less performant. Stability is the important thing here for us.

@owainlewis
Copy link
Member

Hi @jferr

I think for stability you'll want to

  1. Ensure that resources created by Terraform are (as much as possible) only managed by terraform (i.e manually deleting things might cause problems)
  2. Persist the terraform.tfstate throughout the lifecycle of the cluster you are managing. This could be by mounting the statefile on the host somewhere if running this in Docker.

@jferr
Copy link
Author

jferr commented Apr 6, 2018

Thanks @owainlewis I will try this out. I started that way...with a persistant tfstate file...I tried both locally and amazon s3 backed...but I still had lots of issues and failures which required me to manually delete resources via the console.

It seems to me that the most stable should be having terraform read the state w. every run w/o persisting the tfstate between runs...though for some organizations this might not be practical. Can you explain why this wouldn't be the most stable way to go.

@jferr
Copy link
Author

jferr commented Apr 6, 2018

@owainlewis in my testing I often saw that when terraform was refreshing right after I deleted resources via the console (because of the inability to get a successful followup terraform run) I would see references to entities (e.g. subnets) that had already been deleted. Is that the reason why? Is there some bug on Oracle's side where perhaps console changes aren't reflected immediately in the API?

We are very uncomfortable w. Oracle Kubernetes Terraform stability at the moment (we are not live yet) and are escalating this via other channels so any info that you can give us is appreciated.

@owainlewis
Copy link
Member

Hi @jferr

To clarify the "un-deletable" resources issue when you destroy a cluster, this is because the resource cannot be safely deleted (i.e someone/something else is using or referencing the resource). This is a deliberate feature.

I would see references to entities (e.g. subnets) that had already been deleted

The docs are helpful here when discussing why we need to persist the terraform.tfstate file.

It is often asked if it is possible for Terraform to work without state, or for Terraform to not use state and just inspect cloud resources on every run. This page will help explain why Terraform state is required.

https://www.terraform.io/docs/state/purpose.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants