Page MenuHomeFreeBSD

devctl: move to using a uma zone
ClosedPublic

Authored by imp on Aug 21 2020, 6:33 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 25, 1:01 PM
Unknown Object (File)
Wed, Dec 4, 4:38 AM
Unknown Object (File)
Tue, Dec 3, 3:55 AM
Unknown Object (File)
Fri, Nov 29, 9:13 PM
Unknown Object (File)
Oct 31 2024, 12:30 AM
Unknown Object (File)
Oct 22 2024, 3:44 PM
Unknown Object (File)
Oct 10 2024, 7:23 AM
Unknown Object (File)
Oct 8 2024, 9:00 AM
Subscribers

Details

Summary

Convert the memory management of devctl. This is the first step in moving to a
pre-allocated system that makes it always possible to send the message. This
change keeps the interfaces the same, both external and internal, but they will
change in the future.

Rewrite devctl to lean into zones

Rewrite devctl to make better use of the zone. This eliminates several mallocs
(5? worse case) needed to send a message. now it's possible to always send a
message (with the oldest message on the queue reused for the newest message if
necessary).

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
imp added reviewers: mjg, markj.
cem added inline comments.
sys/kern/subr_bus.c
609 ↗(On Diff #76047)

(void) in C

655 ↗(On Diff #76047)

Generally not a fan of goto in macros

678 ↗(On Diff #76047)

I think it’d make more sense to just use sbuf into the fixedlen buf here.

820 ↗(On Diff #76047)

Enabling / disabling this at runtime is a bit weird imo.

844 ↗(On Diff #76047)

This doesn’t handle runtime increases of queue length. And as it’s a nofree zone I don’t think shrinking serves any purpose here. Probably should just produce an error on attempted shrink instead.

update to use sbuf, though in a bit of a naughty way.

sys/kern/subr_bus.c
609 ↗(On Diff #76047)

will fix.

678 ↗(On Diff #76047)

I did, but I had to be naughty and look inside the struct to do some things...

820 ↗(On Diff #76047)

Ah, right, this was pending. I'm going to make it a tunable at boot only. We've never needed this flexibility and it's getting in the way.

sys/kern/subr_bus.c
638 ↗(On Diff #76084)

I'd suggest sizeof(dei->dei_data) over DEVCTL_BUFFER here. Not a big deal.

690–691 ↗(On Diff #76084)

This is fine, but sbuf_finish() returns the same error and could be checked instead of sbuf_error. Definitely not a big deal.

722 ↗(On Diff #76084)

Ok, it is a little ugly but it kinda makes sense. Could also just allocate a second dei as a temporary intermediary buffer and add a copy.

I think the + 1 is basically wrong, though. In the sbuf implementation, it's for reserving a nul byte, even though sbuf only tracks length externally and doesn't add an explicit nul byte until sbuf_finish(). But the bus_* formatting methods are expected to include a nul byte in the provided buffer length, like strlcpy or snprintf.

If this is changed, the (freespace <= 0) checks should be changed to (freespace <= 1).

746–748 ↗(On Diff #76084)

I think we lost a space between these two.

764 ↗(On Diff #76084)

Could be sbuf_putc

The UMA usage looks ok to me. NOFREE is an unfortunately big hammer, but it's really the only one you have if you want to guarantee successful M_NOWAIT allocations. This topic has come up like three or four times in different contexts in the past month so it seems worth it to fix properly, i.e., provide a reliable mechanism that guarantees up to N simultaneous allocations from a reserved pool. uma_zone_reserve() gets part of the way there but has holes:

  • keg_drain() ignores the reservation (I'm testing a patch for this);
  • uma_zfree() doesn't replenish the reservation, so a subsequent alloc may not see the freed item even when the reservation is depleted (I have an idea on how to fix it).
sys/kern/subr_bus.c
419 ↗(On Diff #76084)

If you're fixing style anyway, the opening braces for the struct dev_event_info and struct dev_softc definitions should be on the preceding lines.

652 ↗(On Diff #76084)

Stray debug print.

820 ↗(On Diff #76047)

We also don't, for example, destroy devsoftc.mtx.

imp marked 2 inline comments as done.Aug 23 2020, 8:40 PM
imp added inline comments.
sys/kern/subr_bus.c
419 ↗(On Diff #76084)

Good idea. Done.

652 ↗(On Diff #76084)

I'll remove before commit.

820 ↗(On Diff #76047)

Good point. Will decide exactly what I support, and make sure those use cases work.

imp marked 2 inline comments as done.

handle destory now.

update zone creation when disable and to cope with the tunable case.

imp marked 6 inline comments as done.Aug 24 2020, 6:03 AM
imp added inline comments.
sys/kern/subr_bus.c
722 ↗(On Diff #76084)

fixed.

609 ↗(On Diff #76047)

Fixed after my last push...

820 ↗(On Diff #76047)

Actually, we shouldn't destroy the mtx if we disable devd. We still take the mutex before checking the queue length to see if we're disabled.

844 ↗(On Diff #76047)

Addressed this concern by making it only settable to a non-zero value as a tunable. If markj sorts out the UMA issues, I'll revisit.

imp marked an inline comment as done.Aug 25 2020, 9:45 PM

Ping? Are we good? Or have I missed something ?

sys/kern/subr_bus.c
416 ↗(On Diff #76139)

Extra newline.

629 ↗(On Diff #76139)

Extra newline.

681 ↗(On Diff #76139)

This return statement is redundant.

735 ↗(On Diff #76139)

sbuf_setpos() returns an error if pos > s->s_len, so I'm not sure that this is doing what you want. It looks like it's only supposed to be used for truncation.

imp marked an inline comment as done.Aug 26 2020, 3:33 PM

Will do another spin... sbuf isn't setup for 'callouts' like this :(

Though I think I'll fix the sbuf buf in setpos...

sys/kern/subr_bus.c
735 ↗(On Diff #76139)

Hmmm, you are correct. :(. What an unfortunate interface. I'd maintain it should test against s_size not s_len in that case. At least that's my reading of the man page.

The sbuf_setpos() function sets the sbuf's end position to pos, which is
a value between zero and one less than the size of the storage buffer.
This effectively truncates the sbuf at the new position.

I've fixed the code to reach into the sbuf struct for now...

sbuf API needs some improvements in this area to allow one to pass the sbuf buffer to other routines that aren't sbuf aware to dump their data into it...

imp added inline comments.
sys/kern/subr_bus.c
735 ↗(On Diff #76139)

talking to phk on this point, he relates that the code is correct, and the manual needs updating... quite simply because doing what I'm doing here wasn't contemplated in the sbuf design and is somewhat antithetical to it... He recommends fixing newbus to use sb... :) Likely not wrong, but not today.

sys/kern/subr_bus.c
735 ↗(On Diff #76139)

I agree that the manual is misleading and it would be nice to sbuf-ify the newbus APIs in question (eventually).

imp edited the summary of this revision. (Show Details)
  • Retire devctl_notify_f()

Update per phk's suggestions too

Looks ok to me.

sys/kern/subr_bus.c
677 ↗(On Diff #76267)

Consider adding a counter or rate-limited printf() for the overflow case.

4970 ↗(On Diff #76267)

I would assert (sb->s_flags & SBUF_INCLUDENUL) == 0 here and in the function above.

This revision is now accepted and ready to land.Aug 26 2020, 9:25 PM
sys/kern/subr_bus.c
4970 ↗(On Diff #76267)

why is that needed?
I'm not sure I understand the issue...

sys/kern/subr_bus.c
4970 ↗(On Diff #76267)

Oh! I understand now. Never mind...

This revision now requires review to proceed.Aug 26 2020, 10:38 PM

I think one key problem is that there are setups where non-"global" root can provoke event notification -- not only this poses question about proper sanitization of data, it poses the question what to do with flooding the device. For example jailed root can be allowed to mount/unmount filesystems which now triggers a HUGE message.

The current notion of "there is a message or not" is in my opinion wasteful. A circular buffer where you alloc upfront the bytes you need OR block waiting for more OR abort would work out better. In the simplest case you would populate the reserved area under the global lock. The current patch blindly requires 1KB per event (presumably motivated with the mount/unmount stuff) while most events are way shorter.

The format of space separated key=value is also problematic in that it is bigger than it needs to be, but that's perhaps for a different patchset.

In D26140#582298, @mjg wrote:

I think one key problem is that there are setups where non-"global" root can provoke event notification -- not only this poses question about proper sanitization of data, it poses the question what to do with flooding the device. For example jailed root can be allowed to mount/unmount filesystems which now triggers a HUGE message.

Ignore that. The mount messages need to be redone. But, these issues are more important. The buffers are a little larger to cope in the interim. The mount messages are wrong in a lot of other ways as well, and I need to puzzle through how best to cope (there's both too much information (some of it is bogus, some could be gleaned by other means) and too little (there needs to be some jail ID at least, for example). That's beyond the scope of this patch.

The current notion of "there is a message or not" is in my opinion wasteful. A circular buffer where you alloc upfront the bytes you need OR block waiting for more OR abort would work out better. In the simplest case you would populate the reserved area under the global lock. The current patch blindly requires 1KB per event (presumably motivated with the mount/unmount stuff) while most events are way shorter.

Yea, but the cost is only 1M, and likely more like 25k-50k once my rework is done. But one step at a time.

Also, the policy stuff about what to drop when needs some careful though. I've received feedback over the years that we need better low memory behavior, and it would be better to toss oldest message rather than newest if we can't allocate memory, which is what prompted this design. In a typical system, even when there's a flurry of events, there's not a huge demand since devd is good about keeping up... Though I'm taking to heart that some thought might be needed as we expand to different domains that exist on the system (aka jails) and we need to be mindful of starvation and other similar things...

For now, though, I think it's adequate to allow other things to make progress. Especially since the current cost is 1MB (tiny on modern systems, and also configurable at boot time if I'm wrong).

The format of space separated key=value is also problematic in that it is bigger than it needs to be, but that's perhaps for a different patchset.

That's not going to change in the near term. There's a lot of user-facing scripts that rely on it, and doing something different would require a lot of careful redesign which we don't have time to do before 13 branches. I have a number of cleanups that I'd like to do that would hide much of the key=value stuff from the in-kernel code. This might allow us to do something smarter in the future, but there's a lot of usage in the kernel that I'd like to clean up first.

tl;dr: you're not wrong in the long term, but in the short term I'd like to compromise to make progress towards the long-term goals.

sys/kern/subr_bus.c
4953 ↗(On Diff #76273)

Doh! Removed.

markj added inline comments.
sys/kern/subr_bus.c
4939 ↗(On Diff #76273)

MPASS I guess, and the semicolon is missing.

This revision is now accepted and ready to land.Aug 27 2020, 4:07 PM
jhb added inline comments.
sys/kern/subr_bus.c
4924 ↗(On Diff #76273)
4927 ↗(On Diff #76273)

I assume the longer term plan (which I think is a good one) is to convert the new-bus methods to take an sbuf instead of a (pointer, length) tuple?

4957 ↗(On Diff #76273)
sys/kern/subr_bus.c
4927 ↗(On Diff #76273)

Yes. (Or provide the latter as a wrapper around the former.)

sys/kern/subr_bus.c
4927 ↗(On Diff #76273)

Yup. There may also be some other APIs that I'll add to make this easier.

4939 ↗(On Diff #76273)

Already made that tweak locally.. here and below.

This revision now requires review to proceed.Aug 27 2020, 5:06 PM
In D26140#582304, @imp wrote:
In D26140#582298, @mjg wrote:

I think one key problem is that there are setups where non-"global" root can provoke event notification -- not only this poses question about proper sanitization of data, it poses the question what to do with flooding the device. For example jailed root can be allowed to mount/unmount filesystems which now triggers a HUGE message.

Ignore that. The mount messages need to be redone. But, these issues are more important. The buffers are a little larger to cope in the interim. The mount messages are wrong in a lot of other ways as well, and I need to puzzle through how best to cope (there's both too much information (some of it is bogus, some could be gleaned by other means) and too little (there needs to be some jail ID at least, for example). That's beyond the scope of this patch.

The current notion of "there is a message or not" is in my opinion wasteful. A circular buffer where you alloc upfront the bytes you need OR block waiting for more OR abort would work out better. In the simplest case you would populate the reserved area under the global lock. The current patch blindly requires 1KB per event (presumably motivated with the mount/unmount stuff) while most events are way shorter.

Yea, but the cost is only 1M, and likely more like 25k-50k once my rework is done. But one step at a time.

Also, the policy stuff about what to drop when needs some careful though. I've received feedback over the years that we need better low memory behavior, and it would be better to toss oldest message rather than newest if we can't allocate memory, which is what prompted this design. In a typical system, even when there's a flurry of events, there's not a huge demand since devd is good about keeping up... Though I'm taking to heart that some thought might be needed as we expand to different domains that exist on the system (aka jails) and we need to be mindful of starvation and other similar things...

So the question is what messages can be treated as optional to deliver and what messages (if any) as mandatory.

I just don't see any win from having a NOFREE zone for it, seems avoidably wasteful but if this is an intermediate state to be cleaned up soon(tm) it's probably fine. Still, I think there is a flag to disable per-CPU caching and it would be nice to use it here.

For now, though, I think it's adequate to allow other things to make progress. Especially since the current cost is 1MB (tiny on modern systems, and also configurable at boot time if I'm wrong).

The format of space separated key=value is also problematic in that it is bigger than it needs to be, but that's perhaps for a different patchset.

That's not going to change in the near term. There's a lot of user-facing scripts that rely on it, and doing something different would require a lot of careful redesign which we don't have time to do before 13 branches. I have a number of cleanups that I'd like to do that would hide much of the key=value stuff from the in-kernel code. This might allow us to do something smarter in the future, but there's a lot of usage in the kernel that I'd like to clean up first.

tl;dr: you're not wrong in the long term, but in the short term I'd like to compromise to make progress towards the long-term goals.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 29 2020, 4:30 AM
This revision was automatically updated to reflect the committed changes.