-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: zvol as block volume without attaching to a pod #502
Comments
Hi @aep - I run Prod Mgmt for the OpenEBS team. - You seem to have a strong opinion that you want the underlying storage block mode kernel to be ZFS and that you do not want SPDK as the block allocator/LVol/LVstor layer. - I'd like to understand why you want FS underneath?
I know both technology stacks and Storage Data/Mgmt capabilities very well. - So feel free to be as technical as you need to be. Thanks. |
hey @orville-wright on a high level, spdk is fairly young and unproven, while ZFS has decades of proven stability. Neither maya nor SPDK itself have the necessary tooling yet, like online snapshots , send/recv for offsite backup and recovery, encryption, bitrot protection. Even if they where done tomorrow, they need to be proven first. We're keeping an eye on maya in parallel. It's clearly the future. just not yet. We currently run a classic multi-path SAS cluster with active-passive ZFS, but it needs to scale out and replaced with NVME. ceph is not a match for performance reasons. Specifically the architecture currently planned is to have zfs on each node, then expose 3 zvols per volume over NVMEoF (we have a 400G RDMA fabric), mirroring them with mdraid or something similar on the node that is currently accessing it |
As you probably know, SPDK is a very modern but complex storage/Data Mgmt tech stack thet is designed to run in Userspace. Additionally, there are other companies that actively support fund large Engineering team to it - such as Nvidia (acquired Mellanox and inherited their NVME & RDMA SPDK storage software team), Nutanix, Microsoft, Samsung, SUSE, Oracle, RedHat and IBM. - So there's about +100 hardcore storage engineers from hardcore storage companies actively directing & developing SPDK. - That's pretty impressive IMHO. Yes, the ZFS community is more mature thank SPDK, as ZFS has been around for longer. ZFS was first released in 2006, whereas SPDK was first released in 2013. Re: Tooling We are currently adding the ability to enable users to choose what type of backend 'Block Datastore' you want to managed storage from. The new options will be...
LVM mode is being coded at the moment and is nearing completion. ZFS will come soon after that (in a few months). We're doing this to allow users like yourself to have the choice as to what Block Mgmt Back-end you want to use. (which one you are more comfortable with). Some folks like SPDK, others want ZFS and come prefer LVM. We have around 70,000 users that have deployed our current ZFS-LocalPV stack as of today. So its been Battle tested in PROD, and well used. We're comfortable integrating that code into our Mayastor Nexus Fabric. (not a big eng task for us). Our LVM stack has about 30,000 users that have deployed it globally as LVM-LocalPV. LVM isn't as popular as ZFS, but its older and more mature and slightly more I/O optimized in kernel scsi-layer performance (RAID & md layers). We see that very conservative users have a slight preference for LVM, along with users that cant/dont want to enable ZFS in their distro build. Hope this helps |
wait what, how did i miss that? thats amazing. |
Well... you didn't miss it. This is all new Mayastor functionality and comes under our new DiskPool concept... Today a Mayastor DiskPool can only have SPDK LVols/BDev devices as its backend storage media. We also want to integrate ZFS ZPools into Mayastor DiskPools.
LVM is easier as it more mature and in every LINUX Kernel. The ZFS option is more complicated as we're not sure how many LINUX distro's include ZFS in the kernel by default? or can have ZFS installed by the user as Kernel mode. - (our preference is Kernel mode ZFS and not User-Mode ZFS, because user-mode ZFS is known to be slow and resource hungry). - We're still making some final decisions on this, but @tiagolobocastro 's LVM project is helping us to figure things out. |
All major distros have zfs packages but it will of course never be as well adopted as in-tree options. btrfs and bcachefs are both intending to replace zfs with in tree options. We are keeping a close eye on bcachefs development. Personally I feel like LVM offers no benefits over spkd. It's essentially the same design from an operational perspective. But I understand people may prefer it over spkd due to familiarity and existing tooling. Also might be actually more power efficient for low utilization scenarios? In our testing, the in kernel nvmeof target performed marginally better than spdk, but that might just be lack of tuning. We have low familiarity with the Maya code base, but that's a matter of investment, which will start this summer. We're mostly doing golang, but we'll handle rust just fine I hope. Since we're already heavily invested into zfs, I'd love to become a major contributor specifically in that area. My guess is that once the LVM part is done, it's just a matter of adopting it to zfs. In roughly 2 weeks I hope I can find some time to dig deeper into Maya. Would be great to have some pointers into the diskpool architecture. Or we can wait for it to be done. Not really much time.presure here. |
I'll chat with @tiagolobocastro and @avishnu about the schedule for starting the ZFS work. BTWE... Tiago pinged me last night to say that he finished the Phase-1 integration coding for LVM and its now live in Mayastor !!
It gives us a good feel for how heavy the work is to enhance the DiskPool with new storage mgmt back-ends and expose the features of LVM and ZFS through Mayastor. We will start the internal eval work on the ZFS code. There's some additional issues we have to answer surrounding ZFS... and a few key tech issues that are a somewhat deep down in stack regarding CPU + Mem + Polling/Interrupt resource mgmt for volumes that are Local-PV (no replicated) and not SPDK managed (ZFS or LVM managed). - That stuff requires very high familiarity with Maystor code and the low level architecture. We should have some decisions & direction on ZFS within the next 2 weeks. BTW... are you attending KubeCon Paris? - Our team is. |
@aep will it work if you specify volume mode as 'block' in the PVC and use the PVC from your custom daemon-set? |
Describe the problem/challenge you have
we're building a multi node replicated storage system on top of zfs.
ideally, we'd be able to just reuse zfs-localpv for the zfs part,
since it already works well and we'd effectively just redo all the work.
Describe the solution you'd like
the simple idea is that zfs-localpv would create a zvol with no filesystem on top,
and our custom daemonset takes over from there.
that should probably just be a few lines of code changes to zfs-localpv, which i will gladly figure out myself and open a PR
if the feature is acceptable.
or maybe its already possible and i just dont understand how to force creating the actual volume without attaching it to a pod.
can i just create a PersistentVolume directly? should i just attach all PVCs to my custom daemonset?
or is there another way to coerce zfs-localpv into directly creating the zvols?
the second, much grander idea would be to integrate the feature directly into zfs-localpv somehow,
but i'm not sure if there's much room in openebs, given that maya is probably the same product (except we want zfs underneath, not spdk).
The text was updated successfully, but these errors were encountered: