Page MenuHomeFreeBSD

Default ABD chunk size
AbandonedPublic

Authored by seanc on Sep 16 2017, 11:40 PM.

Details

Summary

The default ABD chunk size is wasteful in common production workloads. This value should probably ultimately end up as a loader.conf value but in the interim can we reduce this value back to the originally submitted value of 1024?

As near as we could tell at work, the extent of the discussion of bumping the value from 1K to 4K was found here, but done so without any real world measurement: https://github.com/openzfs/openzfs/pull/326#issuecomment-291223116

A bugzilla issue has been opened up, too: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222377

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

seanc created this revision.Sep 16 2017, 11:40 PM

FYI - we use 1K ABD chunks at Delphix (on illumos), and we've found that up to 40% of memory can end up being wasted because of memory fragmentation. At least on illumos, the 1K ABD kmem cache will use slab size 4K (one page), so there are 4 chunks per slab. After eviction, we may have many partially-full slabs, which wastes the free chunks. We are considering implementing a kmem_move callback which would allow the free chunks to be consolidated, reducing the potential for memory waste. But it isn't done yet.

mav added a comment.EditedSep 18 2017, 1:46 AM

I'll second @mahrens worry about possible memory fragmentation. Plus FreeBSD kernel memory allocator does not even have the mentioned hooks to reallocate memory fighting it. Plus I still hope we can make GEOM vdev to use unmapped I/O to avoid extra buffer copying, which depends on blocks to be multiple of page size (4KB). I am not against making it loader tunable, but not ready to change the default or promote the tunable.

seanc added a comment.Sep 18 2017, 5:47 PM

@mav I'm less concerned about fragmentation due to a changing workload. I'm much more concerned about the steady-state waste that ABD represents for smaller record sizes. Let me quote a recent issue where we investigated memory usage:

In our Manta environment, we have seen high ABD waste – a relatively typical example on a PostgreSQL database had 27GiB of ABD waste on a 256 GiB system where the ARC size was ~100 GiB. For us, the value of zfs_abd_chunk_size of 4K is inducing this waste because given our record size (8K) and our ARC compression rate (~3.31X) yields an average ABD size of ~2500 bytes (well below the ABD chunk size). Ironically, this is nearly exactly the experience of the original ABD implementers, who – for essentially identical reasons – selected a 1K ABD chunk size.

Fragmentation is a separate issue compared to the outright waste incurred with a compressed ARC and small-enough record sizes.

mav added a comment.EditedSep 19 2017, 10:42 PM

Fragmentation is a separate issue compared to the outright waste incurred with a compressed ARC and small-enough record sizes.

It's a price, and it is not completely negligible. I am not saying that there is no problem, but it is not an obvious one line change. If you want simple one line change -- make it RD_TUN sysctl. I'll probably make it so as soon as have free time.

seanc added a comment.EditedSep 20 2017, 2:42 PM

avg@ added the loader tunable to make this adjustable in https://svnweb.freebsd.org/base?view=revision&revision=323797 however I have every reason to believe (and measurements from production) that 4K is a going to result in more waste than a 1K default. A 1K default may result in fragmentation with 4K slabs but there is less waste for everyone with 1K than 4K.

@mahrens , do you think OpenZFS will go back to a 1K default?

@seanc OpenZFS/illumos has been at 4K chunk size since ABD was introduced. I'm not aware of any discussion around changing that.

avg added a subscriber: avg.Oct 1 2017, 3:09 PM

avg@ added the loader tunable to make this adjustable in https://svnweb.freebsd.org/base?view=revision&revision=323797 however I have every reason to believe (and measurements from production) that 4K is a going to result in more waste than a 1K default. A 1K default may result in fragmentation with 4K slabs but there is less waste for everyone with 1K than 4K.
@mahrens , do you think OpenZFS will go back to a 1K default?

Since you have compression on, and it is being so effective, have you considered using a larger record size, size you are not going to have write amplification, so the reason for using an 8k record doesn't make sense? If your compression is averaging 3.31x @ 8k records, I would expect an even higher number at 16k records, and that would end up with most chunks being just a bit less than 4k, reducing ABD waste, increasing compression and throughput, reducing disk slack (unless you are using 512b disks, but I assume you have SSDs for the database). With SSDs having a larger write size anyway, there may be value in using an even larger record size. With the Postgres parameter full_page_writes turned off, you should see less write amplification as well.

seanc abandoned this revision.Jan 22 2018, 6:59 AM

@allanjude / @mahrens , it's worth pointing out that we eventually abandoned this change and went back to a 4K ABD chunk size. So while1K may have been more memory efficient in the short term, it ended up being suboptimal in the long run. I'm abandoning this issue and hoping no one repeats our lessons.

https://github.com/joyent/illumos-joyent/commit/2bd6ca8c3cc70becca5f99bbf557b70ac3dfdaf7
https://smartos.org/bugview/OS-6387
https://smartos.org/bugview/OS-6363