Page MenuHomeFreeBSD

Make MAXPHYS tunable.
ClosedPublic

Authored by kib on Nov 15 2020, 10:53 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 21, 8:16 AM
Unknown Object (File)
Thu, Dec 19, 7:17 PM
Unknown Object (File)
Thu, Dec 19, 7:12 PM
Unknown Object (File)
Thu, Dec 19, 7:11 PM
Unknown Object (File)
Thu, Dec 19, 7:03 PM
Unknown Object (File)
Thu, Dec 19, 7:02 PM
Unknown Object (File)
Thu, Dec 19, 7:02 PM
Unknown Object (File)
Thu, Dec 19, 3:58 PM

Details

Reviewers
slm
mav
imp
mckusick
scottl
Group Reviewers
cam
Restricted Owners Package(Owns No Changed Paths)
Commits
rS368124: Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Summary

Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs sized to atop(maxphys) + 1. +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (*). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 34820

Event Timeline

kib requested review of this revision.Nov 15 2020, 10:53 AM

This is WIP. The posted snapshot of the patch booted multiuser with root over NFS, and also survived UFS mount from ciss(4).

TODO:

  • drivers;
  • remove MAXPHYS from opt_global.h once all drivers are fixed.

This looks great kib! It will be very nice to have this feature.

sizeof(bp)->sizeof(*bp)

Does ZFS need any tweaking? I know it's mildly wrong today in scheduling 1MB I/Os today, but at least GEOM will break that up so the I/Os from there on down will be legal.

I like this a lot more than my just bumping the constant. Thank you for taking the time to make this so much better.

In D27225#607878, @imp wrote:

Does ZFS need any tweaking? I know it's mildly wrong today in scheduling 1MB I/Os today, but at least GEOM will break that up so the I/Os from there on down will be legal.

No, it doesn't, because on other platforms large I/Os are not a problem for a long time, it was only our pain. As I have told, it already supports 1MB blocks (still not by default, but widely advertized) and does I/O aggregation up to 1MB for HDDs. For SSDs I've previously reduced the aggregation to 128KB due to previous MAXPHYS, but I still don't think it makes sense to aggregate more for SSDs, at least until it is reimplemented there without huge extra memcpy().

In D27225#607879, @mav wrote:
In D27225#607878, @imp wrote:

Does ZFS need any tweaking? I know it's mildly wrong today in scheduling 1MB I/Os today, but at least GEOM will break that up so the I/Os from there on down will be legal.

No, it doesn't, because on other platforms large I/Os are not a problem for a long time, it was only our pain. As I have told, it already supports 1MB blocks (still not by default, but widely advertized) and does I/O aggregation up to 1MB for HDDs.

Sounds like it's doing the right thing then. this clears up a point you made that I didn't quite understand as well, so I'm back with you. Thanks.

For SSDs I've previously reduced the aggregation to 128KB due to previous MAXPHYS, but I still don't think it makes sense to aggregate more for SSDs, at least until it is reimplemented there without huge extra memcpy().

The only ones it would make sense on are the ones whose physical block size is 256k. I've seen a few of those as samples from different vendors, though I don't know if they are in the market today... Though removing the extra memcpy would be good (but removing needless memcpy is always good, right?)...

sys/kern/subr_param.c
106

Should we mark those as __read_mostly?

kib marked an inline comment as done.Nov 16 2020, 10:22 AM
kib added inline comments.
sys/kern/subr_param.c
106

The maxphys var is mostly unused after the patch, its consumers are driver initialization code.

Also I am not a fan of jealous application of micro-optimizations without proof of importance.

kib marked an inline comment as done.

Apply mav' patch.
Still more drivers and geoms to handle.

Handle rest of MAXPHYS uses.
Remove MAXPHYS from opt_global.h.

This should be complete patch, modulo bugs.

It looks great! Thanks!

Just one more thought (though it may go separately): could we increase pbufs by one more page to be able map to KVA misaligned maxphys blocks? It would make physio and CAM pass APIs much nicer to user. I've had complaints from some tape software developers that they can not guarantee buffer alignment and has to reduce buffer in half because of that. Or it is too big KVA waste?

This revision is now accepted and ready to land.Nov 16 2020, 2:42 PM
In D27225#608116, @mav wrote:

It looks great! Thanks!

Just one more thought (though it may go separately): could we increase pbufs by one more page to be able map to KVA misaligned maxphys blocks? It would make physio and CAM pass APIs much nicer to user. I've had complaints from some tape software developers that they can not guarantee buffer alignment and has to reduce buffer in half because of that. Or it is too big KVA waste?

I would worry about reserved KVA misalignment then. Right now, I believe, pbuf' allocated KVA is naturally aligned.

In fact, is one page enough ? Shouldn't it be two pages of slack to be useful to map unaligned maxphys-length buffer ?

In D27225#608119, @kib wrote:

I would worry about reserved KVA misalignment then. Right now, I believe, pbuf' allocated KVA is naturally aligned.

Does the alignment help anyhow there? Superpages or something perspective? Or you mean misalignment of something beside pbufs themselves, being pushed forward?

In fact, is one page enough ? Shouldn't it be two pages of slack to be useful to map unaligned maxphys-length buffer ?

In all the code we are adding one extra page. Why/how would we need more than one?

Add one slack page to pbufs to allow unaligned buffers.

This revision now requires review to proceed.Nov 16 2020, 3:21 PM
markj added inline comments.
sys/kern/vfs_aio.c
1262

The + 1 looks like it's in the wrong place, but I do not really understand why it's there to begin with.

sys/kern/vfs_aio.c
1262

Indeed misplaced, while needed to represent misaligned maxphys buffers.

1264

This condition can be relaxed with extra page in pbuf.

sys/sys/param.h
164

If MAXPHYS is left here for some time for some driver compatibility, I think it would be right to include opt_maxphys.h here too. I was surprised how sbp(4) driver builds while still having MAXPHYS, remember that you've moved the option to separate header.

The nvme and CAM changes look good. I've not closely reviewed beyond that but the spot checks seem fine and to be a simple s/MAXPHYS/maxphys/ or moral equivalent with b_pages.

This revision is now accepted and ready to land.Nov 16 2020, 6:43 PM
mckusick added a subscriber: mckusick.

I have wanted this change for a long time. Thanks for doing it.

sys/sys/systm.h
78

Add comment /* max raw I/O transfer size */

kib marked 5 inline comments as done.Nov 17 2020, 12:07 AM
kib added inline comments.
sys/kern/vfs_aio.c
1295

nitems(job->pages) is wrong as well.

sys/sys/param.h
164

It cannot be done this way because sys/param.h is also used by userspace.
In fact I wanted to remove MAXPHYS define at all, at least from common headers, but this is not going to work due to userspace.

For sbp(4), can you please point to the full file path and line where MAXPHYS is left ?

sys/sys/param.h
164

According to my sources I see two:
sys/dev/firewire/sbp.c:74
sys/dev/isci/scil/sci_controller_constants.h:160

sys/sys/param.h
164

Could we use #define MAXPHYS __get_maxphys() for userland? Or does this need to be a compile time constant for userland?

kib marked 3 inline comments as done.Nov 17 2020, 12:33 AM
kib added inline comments.
sys/sys/param.h
164

I see how I missed them, thanks.

164

No we cannot do a trick with __get_maxphys(), it should be constant. I do not see an issue with having it this way. If filesystems or drivers want to inform userspace about proper transfer sizes, there are already channels to communicate that, e.g. statfs(2).

kib marked 2 inline comments as done.

Fix bugs in aio.
Handle sbp and isci.

This revision now requires review to proceed.Nov 17 2020, 12:35 AM
sys/sys/param.h
164

OK. It might make sense, then, to see what we're using it for in userland. Most places a compile-time thing would be fine, but there's at least two places that use MAXPHYS to size an array... There's a number of places that use it to scale buffers for performance reasons, but others that use it to limit kernel requests.... Clearly beyond the scope of this work, though....

Make vmapbuf() use all pbuf pages.

sys/dev/nvme/nvme_ctrlr.c
1258

Would you also remove the above block, since now vmapbuf() should do the right thing?

kib marked 2 inline comments as done.Nov 17 2020, 6:46 PM
kib added inline comments.
sys/dev/nvme/nvme_ctrlr.c
1258

I would prefer you to do it. Perhaps you have a test setup to exercise this code path, neither I nor pho do not.

sys/dev/nvme/nvme_ctrlr.c
1258

Do you mean also in separate commit? OK.

kib marked an inline comment as done.

Ensure that maxphys is power of two.
Add some comments.

sys/kern/subr_param.c
303

Is there a minimum that needs to be enforced?

kib marked an inline comment as done.Nov 18 2020, 4:26 AM
kib added inline comments.
sys/kern/subr_param.c
303

I am not sure, perhaps one page is absolutely required, I am not sure how far can we proceed with it.

I somewhat dislike even the checks I added, I do not see why users should be not allowed to experiment. The cost of mistake there is low, at worst machine won't boot. It is somewhat costly for a user not having access to console, but such user probably ought to not make experiments like that.

This revision is now accepted and ready to land.Nov 22 2020, 4:13 AM
sys/kern/subr_param.c
303

I think a basic check of PAGE_SIZE is a reasonable seat-belt. Otherwise some might think that 512b is a reasonable MAXPHYS, or even 0, both of which is think would result in a system unable to do any disk I/O. It's not like this would be checked on every I/O, it's just a boot-time seatbelt. I'm not going to object to it's absence, and I might add it later if it doesn't happen.

alc added inline comments.
sys/kern/vfs_aio.c
2350–2351

Are you should that reordering vm_page_unhold_pages() and pmap_qremove() is universally safe? Consider powerpc{,64}. The powerpc{,64} pmaps create PVO entries for the unmanaged mappings created by pmap_qenter(). Moreover, pmap_page_is_mapped() doesn't distinguish between PVO entries representing managed versus unmanaged mappings. So, if the page is freed by vm_page_unhold_pages(), can't the KASSERT(!pmap_page_is_mapped()) fail?

kib marked an inline comment as done.Nov 28 2020, 10:42 PM
kib added inline comments.
sys/kern/vfs_aio.c
2350–2351

This is, in fact, a surprise for me. I thought that qenter() mappings are never accounted for as mappings for pmap_page_is_mapped().

The reordering was needed because job->pages is really pbuf->b_pages, and code freed pbuf before doing unwire().

D27409

This revision was automatically updated to reflect the committed changes.
Owners added a reviewer: Restricted Owners Package.Nov 29 2020, 8:45 AM