Why bucket contents are not cached in bucket-controller? #5131

kgoutham93 · 2025-01-06T12:16:05Z

kgoutham93
Jan 6, 2025

I have configured Bucket CR to mirror contents of an AWS S3 bucket. I observeed that, all the bucket contents are downloaded even when a single file in the bucket is updated. (Same observation when using /prefix)

I have gone through bucket_controller.go and figured that we are already building an index of bucket content along with their etags. We are checking if the computed digest matches the digest stored on Bucket CR's status. If not, we iterate over the index and download object content for each file.

Are there any known challenges for caching bucket contents?

Can we integrate cache package in bucket controller such that we download on the files whose etag is different from the one recorded in the cache?

I am more than happy to contribute towards this enhancement.

Answered by stefanprodan

Jan 7, 2025

@kgoutham93 If multi-tenancy is a factor here, then having a dedicated OCIRepository per tenant makes way more sense than storing all things in a single bucket IMO. You could use the same registry and have some convention for tags, encoding the tenant name/id would suffice. Instead of an S3 PUT you would run flux push artifact registry/manifests:tenant1.

View full answer

matheuscscp · 2025-01-06T12:52:38Z

matheuscscp
Jan 6, 2025
Maintainer

Have you considered using the source OCIRepository instead of Bucket? It is much safer and recommended, you can sign OCI artifacts using cosign and have Flux verify the signature in-cluster right before consuming the artifact. I'd personally try to avoid buckets as much as possible.

0 replies

kgoutham93 · 2025-01-06T15:34:00Z

kgoutham93
Jan 6, 2025
Author

Our use case is slightly different. We have an API layer dumping yaml in an S3 bucket. We are utilising flux to deploy these yamls to various clusters at runtime.

…

On Mon, 6 Jan 2025, 18:23 Matheus Pimenta, ***@***.***> wrote: Have you considered using the source OCIRepository instead of Bucket? It is much safer and recommended, you can sign OCI artifacts using cosign and have Flux verify the signature in-cluster right before consuming the artifact. I'd personally try to avoid buckets as much as possible, — Reply to this email directly, view it on GitHub <#5131 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQALPQN7VQIPKWSDEKK4232JJ4CZAVCNFSM6AAAAABUVPWSFSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZUHA2DMNA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

9 replies

kgoutham93 Jan 7, 2025
Author

Hello Matheus, you are spot on :), the use case is to setup infrastructure dynamically depending on business events like new tenants, new subscriptions etc. These "platform" APIs, when invoked by "application" teams will create Kubernetes CRs which are reconciled by our custom controllers to dynamically setup/teardown the infrastructure. We are trying to globalize these platform APIs to enable businesses to issue new tenants/subscriptions across regions.

How do we use Flux?
We are building a globally available API and shove the CRs into multi-region S3 buckets. Our operators and flux will be installed in each Kubernetes cluster. We are experimenting with flux Bucket and Kustomization CRs to deploy our custom CRD instances into a specific target cluster. The platform API saves the manifests in cluster specific prefixes, and we restrict Bucket CR to only view/pull manifests belonging to the owning cluster.

kingdonb Jan 7, 2025
Collaborator

Yeah, I was suggesting that the CI process could be triggered after the API is done with its S3 PUT.

We use repository_dispatch actions on GitHub to allow arbitrary webhooks to trigger CI actions in Flux, there are some examples but not a particularly relevant one. The CI workflow would push to OCI.

It adds another layer of indirection and that may not be desirable, but it would provide for caching. I don't know how large these objects are, or if there are thousands of objects - maybe this still doesn't accomplish the goal with OCI. The OCI layers are cached but each artifact bundles all of the manifests into a single layer, .tgz file. So each new change will end up generating an entire new artifact, there won't be any savings or per-object caching.

If the goal is to cache individual objects so that only updated objects are fetched, I don't think that any Flux source supports this today.

stefanprodan Jan 7, 2025
Maintainer

Thousands of Kubernetes objects in YAML format are noting in terms of size, the OCI artifact is a tar.gz we are talking about couple of Mb tops.

stefanprodan Jan 7, 2025
Maintainer

@kgoutham93 If multi-tenancy is a factor here, then having a dedicated OCIRepository per tenant makes way more sense than storing all things in a single bucket IMO. You could use the same registry and have some convention for tags, encoding the tenant name/id would suffice. Instead of an S3 PUT you would run flux push artifact registry/manifests:tenant1.

Answer selected by kgoutham93

kgoutham93 Jan 8, 2025
Author

Thanks for the suggestions @kingdonb and @stefanprodan

I will checkout OCIRepository. My understanding is as OCIRepository is much fine-grained unit compared to Bucket, only the ones with updated shas are pulled. But then again I have to create the OCIRepository CRs dynamically.

If the goal is to cache individual objects so that only updated objects are fetched, I don't think that any Flux source supports this today.

I see the cache package referenced in HelmRepository source. Am I missing something? EDIT I see, we are only caching the Helm repo index. The repo contents (I am assuming) are downloaded everytime something within the helm repo gets updated!

matheuscscp Jan 8, 2025
Maintainer

Maybe use a Bucket+Kustomization object pair to deploy OCIRepository + Kustomization/HelmRelease object pairs, maybe that reduces the scale of what you need to shove into the s3 bucket. But yes you'd still need to build the OCI artifacts the OCIRepository objects refer to somewhere, you can do that in your API as well (before publishing an object pair in the bucket).

stefanprodan Jan 8, 2025
Maintainer

But then again I have to create the OCIRepository CRs dynamically.

It's the same as the Bucket you use now. Instead of source Bucket, you would have source OCIRepository, and instead of using prefix you would use a tag.

stefanprodan · 2025-01-06T16:05:24Z

stefanprodan
Jan 6, 2025
Maintainer

All Flux sources work the same, if the computed digest changes (for Git this the head sha, for OCI this is the artifact digest) we fetch the whole content. Tracking each file and their checksum is not something we want to do, would mean storing a huge amount of information in the status thus in etcd which is limited to 1Mbi.

1 reply

kgoutham93 Jan 7, 2025
Author

Hello Stefan,

We could still track all the contents of a bucket source as a single artifact, compute the artifact digest from the individual file etags, everything would still stay the same. The only change could be, conditionally(when etag changes) issuing FGetObject. This could reduce the API calls to cloud provider APIs as well as the reconciliation time for the Bucket resource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why bucket contents are not cached in bucket-controller? #5131

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why bucket contents are not cached in bucket-controller? #5131

kgoutham93 Jan 6, 2025

Replies: 3 comments · 10 replies

matheuscscp Jan 6, 2025 Maintainer

kgoutham93 Jan 6, 2025 Author

kgoutham93 Jan 7, 2025 Author

kingdonb Jan 7, 2025 Collaborator

stefanprodan Jan 7, 2025 Maintainer

stefanprodan Jan 7, 2025 Maintainer

kgoutham93 Jan 8, 2025 Author

matheuscscp Jan 8, 2025 Maintainer

stefanprodan Jan 8, 2025 Maintainer

stefanprodan Jan 6, 2025 Maintainer

kgoutham93 Jan 7, 2025 Author

kgoutham93
Jan 6, 2025

Replies: 3 comments 10 replies

matheuscscp
Jan 6, 2025
Maintainer

kgoutham93
Jan 6, 2025
Author

kgoutham93 Jan 7, 2025
Author

kingdonb Jan 7, 2025
Collaborator

stefanprodan Jan 7, 2025
Maintainer

stefanprodan Jan 7, 2025
Maintainer

kgoutham93 Jan 8, 2025
Author

matheuscscp Jan 8, 2025
Maintainer

stefanprodan Jan 8, 2025
Maintainer

stefanprodan
Jan 6, 2025
Maintainer

kgoutham93 Jan 7, 2025
Author