Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't swallow errors when uninstalling FluxCD #4394

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

knutgoetz
Copy link

The functions called via the uninstall command might return errors, which are ignored right now. I think users prefer to get noticed about issues, when uninstalling FluxCD.

This was originally brought up by @darkowlzz in the last Bugscrub Meeting.

I wasn't sure how to write an adequate test, so i tested the change adding some random error to

var aggregateErr []error

which resulted in

panic: uninstall failed: ► deleting components in flux-system namespace                                        
✔ Deployment/flux-system/helm-controller deleted         
✔ Deployment/flux-system/image-automation-controller deleted 
✔ Deployment/flux-system/image-reflector-controller deleted  
✔ Deployment/flux-system/kustomize-controller deleted    
✔ Deployment/flux-system/notification-controller deleted 
✔ Deployment/flux-system/source-controller deleted       
✔ Service/flux-system/notification-controller deleted    
✔ Service/flux-system/source-controller deleted          
✔ Service/flux-system/webhook-receiver deleted           
✔ NetworkPolicy/flux-system/allow-egress deleted         
✔ NetworkPolicy/flux-system/allow-scraping deleted       
✔ NetworkPolicy/flux-system/allow-webhooks deleted       
✔ ServiceAccount/flux-system/helm-controller deleted     
✔ ServiceAccount/flux-system/image-automation-controller deleted                                               
✔ ServiceAccount/flux-system/image-reflector-controller deleted                                                
✔ ServiceAccount/flux-system/kustomize-controller deleted
✔ ServiceAccount/flux-system/notification-controller deleted 
✔ ServiceAccount/flux-system/source-controller deleted   
✔ ClusterRole/crd-controller-flux-system deleted         
✔ ClusterRole/flux-edit-flux-system deleted              
✔ ClusterRole/flux-view-flux-system deleted              
✔ ClusterRoleBinding/cluster-reconciler-flux-system deleted  
✔ ClusterRoleBinding/crd-controller-flux-system deleted  
► deleting toolkit.fluxcd.io finalizers in all namespaces
 error:'this is a new error'

when running the e2e tests.

@@ -90,16 +90,24 @@ func uninstallCmdRun(cmd *cobra.Command, args []string) error {
}

logger.Actionf("deleting components in %s namespace", *kubeconfigArgs.Namespace)
uninstall.Components(ctx, logger, kubeClient, *kubeconfigArgs.Namespace, uninstallArgs.dryRun)
if err := uninstall.Components(ctx, logger, kubeClient, *kubeconfigArgs.Namespace, uninstallArgs.dryRun); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to ignore not found errors using https://github.com/kubernetes/apimachinery/blob/v0.27.4/pkg/api/errors/errors.go#L527 . Any other errors may be important.

@stefanprodan
Copy link
Member

stefanprodan commented Nov 10, 2023

Breaking the execution is not something that I would consider doing. People use uninstall on clusters with stuck resources, nodes, etc. We could log and continue.

@knutgoetz
Copy link
Author

Breaking the execution is not something that I would consider doing. People use uninstall on clusters with stuck resources, nodes, etc. We could log and continue.

Okay, makes sense. To be honest maybe then there is nothing to do?
As far as i can tell, logging the errors is done in pgk uninstall. Still looks a bit weird to me to ignore these errors and exit successfully but maybe this was intended all along and the errors are only relevant somewhere else.

@darkowlzz
Copy link
Contributor

darkowlzz commented Nov 14, 2023

The notion of this being a bug originates from #4355 (comment) where I was surprised to see the program end with success status code. Looking at the underlying code, it seemed more clear that it's a bug as it ignores every kind of error, including connection errors:

$ flux uninstall 
Are you sure you want to delete Flux and its custom resource definitions: y
► deleting components in flux-system namespace
► deleting toolkit.fluxcd.io finalizers in all namespaces
► deleting toolkit.fluxcd.io custom resource definitions
✗ Namespace/flux-system deletion failed: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://127.0.0.1:34068/api/v1": dial tcp 127.0.0.1:34068: connect: connection refused
✔ uninstall finished

$ echo $?
0

I changed the API server port in kubeconfig to create this scenario.

I think it's less about halting the execution and more about returning an appropriate status code to help the other systems that use the CLI to know when it failed to do the job. A CI system, may be configured to run against a persistent cluster, may assume that uninstall was successful because of the status code.

I think this can be handled nicely by aggregating the errors returned from each of the uninstall steps, ignoring not found errors. If the aggregated error contains any error at the end of uninstallCmdRun(), it should return the aggregated error. This way, the execution continues even when encountered with error and the program ends with a non-zero status code and with relevant errors.


Just tried what I said above and I see more issues around the scenario described above. In the above output, there's only an error about namespace but not about other things that too should have failed due to connection issue. If I pass the flag to keep the namespace:

$ flux uninstall --keep-namespace
Are you sure you want to delete Flux and its custom resource definitions: y
► deleting components in flux-system namespace
► deleting toolkit.fluxcd.io finalizers in all namespaces
► deleting toolkit.fluxcd.io custom resource definitions
✔ uninstall finished

$ echo $?
0

It visually appears as if there's nothing wrong. I still have the altered kubeconfig which should result in a failure.

All the uninstall functions in pkg/uninstall/uninstall.go, except for Namespace() ignore the error when listing the resources. If we go about continuing the execution, ignoring any error but collecting them, we'll end up with a huge list of errors. Just to demonstrate what it results in, I tried it:

$ flux uninstall
Are you sure you want to delete Flux and its custom resource definitions: y
► deleting components in flux-system namespace
► deleting toolkit.fluxcd.io finalizers in all namespaces
► deleting toolkit.fluxcd.io custom resource definitions
✗ Namespace/flux-system deletion failed: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://127.0.0.1:34068/api/v1": dial tcp 127.0.0.1:34068: connect: connection refused
✗ [failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get "https://127.0.0.1:34068/apis/apps/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://127.0.0.1:34068/api/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.io/v1: Get "https://127.0.0.1:34068/apis/networking.k8s.io/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: rbac.authorization.k8s.io/v1: Get "https://127.0.0.1:34068/apis/rbac.authorization.k8s.io/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: source.toolkit.fluxcd.io/v1: Get "https://127.0.0.1:34068/apis/source.toolkit.fluxcd.io/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: source.toolkit.fluxcd.io/v1beta2: Get "https://127.0.0.1:34068/apis/source.toolkit.fluxcd.io/v1beta2": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: kustomize.toolkit.fluxcd.io/v1: Get "https://127.0.0.1:34068/apis/kustomize.toolkit.fluxcd.io/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: helm.toolkit.fluxcd.io/v2beta1: Get "https://127.0.0.1:34068/apis/helm.toolkit.fluxcd.io/v2beta1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: notification.toolkit.fluxcd.io/v1beta2: Get "https://127.0.0.1:34068/apis/notification.toolkit.fluxcd.io/v1beta2": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: notification.toolkit.fluxcd.io/v1: Get "https://127.0.0.1:34068/apis/notification.toolkit.fluxcd.io/v1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: image.toolkit.fluxcd.io/v1beta2: Get "https://127.0.0.1:34068/apis/image.toolkit.fluxcd.io/v1beta2": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: image.toolkit.fluxcd.io/v1beta1: Get "https://127.0.0.1:34068/apis/image.toolkit.fluxcd.io/v1beta1": dial tcp 127.0.0.1:34068: connect: connection refused, failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get "https://127.0.0.1:34068/apis/apiextensions.k8s.io/v1": dial tcp 127.0.0.1:34068: connect: connection refused]

Tried to emphasize the length of the same repeated error we end up with.

Because of this, I'd hesitate to ignore error and continue for everything and end up with too many of the same error. I think it'll be much better to intentionally ignore specific errors like not found errors (there are other error check helpers in apimachinery errors package if needed to identify timeout and other types of errors) and return from the individual uninstall functions as soon as any other kind of errors are encountered. We can continue with the other delete functions, but return immediately as soon as a problematic error is returned. In the ideal case scenario, this would retain the same behavior of uninstall, ignoring non-problematic errors like not found. For worse case, I think we need a force or ignore every error kind of flag.

@darkowlzz
Copy link
Contributor

darkowlzz commented Nov 23, 2023

@knutgoetz I had a discussion about this with Stefan and we agreed to do the following to address the issues in my previous comment:

  • Update all the list operations that are run during uninstall to handle errors.
  • For the update operations while removing the finalizers on the objects, check the returned error and handle them gracefully when update is issued against a non-existing object. Any other unexpected error should result in a failure.
  • In the whole uninstall process, ignore any not-found errors. Any other unexpected error should result in immediate return and exit with non-zero exit code with the encountered error.
  • Introduce a force flag in the uninstall command to run uninstall like today, ignoring all the unexpected errors.

We can have more discussions about further details here as we make progress and have new code.
If you prefer, you can divide the work into multiple separate PRs. It's up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants