-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't swallow errors when uninstalling FluxCD #4394
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Knut Götz <[email protected]>
@@ -90,16 +90,24 @@ func uninstallCmdRun(cmd *cobra.Command, args []string) error { | |||
} | |||
|
|||
logger.Actionf("deleting components in %s namespace", *kubeconfigArgs.Namespace) | |||
uninstall.Components(ctx, logger, kubeClient, *kubeconfigArgs.Namespace, uninstallArgs.dryRun) | |||
if err := uninstall.Components(ctx, logger, kubeClient, *kubeconfigArgs.Namespace, uninstallArgs.dryRun); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to ignore not found errors using https://github.com/kubernetes/apimachinery/blob/v0.27.4/pkg/api/errors/errors.go#L527 . Any other errors may be important.
Breaking the execution is not something that I would consider doing. People use uninstall on clusters with stuck resources, nodes, etc. We could log and continue. |
Okay, makes sense. To be honest maybe then there is nothing to do? |
The notion of this being a bug originates from #4355 (comment) where I was surprised to see the program end with success status code. Looking at the underlying code, it seemed more clear that it's a bug as it ignores every kind of error, including connection errors: $ flux uninstall
Are you sure you want to delete Flux and its custom resource definitions: y
► deleting components in flux-system namespace
► deleting toolkit.fluxcd.io finalizers in all namespaces
► deleting toolkit.fluxcd.io custom resource definitions
✗ Namespace/flux-system deletion failed: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://127.0.0.1:34068/api/v1": dial tcp 127.0.0.1:34068: connect: connection refused
✔ uninstall finished
$ echo $?
0 I changed the API server port in kubeconfig to create this scenario. I think it's less about halting the execution and more about returning an appropriate status code to help the other systems that use the CLI to know when it failed to do the job. A CI system, may be configured to run against a persistent cluster, may assume that uninstall was successful because of the status code. I think this can be handled nicely by aggregating the errors returned from each of the uninstall steps, ignoring not found errors. If the aggregated error contains any error at the end of Just tried what I said above and I see more issues around the scenario described above. In the above output, there's only an error about namespace but not about other things that too should have failed due to connection issue. If I pass the flag to keep the namespace: $ flux uninstall --keep-namespace
Are you sure you want to delete Flux and its custom resource definitions: y
► deleting components in flux-system namespace
► deleting toolkit.fluxcd.io finalizers in all namespaces
► deleting toolkit.fluxcd.io custom resource definitions
✔ uninstall finished
$ echo $?
0 It visually appears as if there's nothing wrong. I still have the altered kubeconfig which should result in a failure. All the uninstall functions in pkg/uninstall/uninstall.go, except for
Tried to emphasize the length of the same repeated error we end up with. Because of this, I'd hesitate to ignore error and continue for everything and end up with too many of the same error. I think it'll be much better to intentionally ignore specific errors like not found errors (there are other error check helpers in apimachinery errors package if needed to identify timeout and other types of errors) and return from the individual uninstall functions as soon as any other kind of errors are encountered. We can continue with the other delete functions, but return immediately as soon as a problematic error is returned. In the ideal case scenario, this would retain the same behavior of uninstall, ignoring non-problematic errors like not found. For worse case, I think we need a force or ignore every error kind of flag. |
@knutgoetz I had a discussion about this with Stefan and we agreed to do the following to address the issues in my previous comment:
We can have more discussions about further details here as we make progress and have new code. |
The functions called via the
uninstall
command might return errors, which are ignored right now. I think users prefer to get noticed about issues, when uninstalling FluxCD.This was originally brought up by @darkowlzz in the last Bugscrub Meeting.
I wasn't sure how to write an adequate test, so i tested the change adding some random error to
flux2/pkg/uninstall/uninstall.go
Line 138 in 3c8072d
which resulted in
when running the
e2e
tests.