The swap pager (swapgeom at least) appears to be compatible with
direct dispatch. Flag the consumer we create as such. In addition,
decrement the in flight index when we have an out of memory
error. This would prevent the swap file from being able to be unloaded
since it would always be busy after that.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 389 Build 389: arc lint + arc unit
Event Timeline
The leak fix looks fine. Theoretically there can be a race with simultaneous close, but it is probably too rare to bother much.
What's for direct dispatch support -- I know little about the swapper code, but if you say it is good -- I have no objections.
sys/vm/swap_pager.c | ||
---|---|---|
2607 | As I understand, the index field of g_consumer counts in-flight transactions. When index == 0 and private is not NULL, the consumer should be 'closed' (I am not sure that this is right term). This is not done there for case when index reaches 0. Also, wouldn't it me cleaner to initialize index with 1 and do not touch private at all, just changing the construct in swapgeom_orphan to cp->index--; destroy = cp->index == 0; ? | |
2715 | Note that swapgeom_strategy() uses sleepable bio allocation call g_alloc_bio() for read requests. I think this is incompatible with the direct dispatch, but I may be wrong. |
Other then the issue at line 2584, the rest looks fine.
sys/vm/swap_pager.c | ||
---|---|---|
2607 | This fragment still lacks handling of cp->index == 0. |
sys/vm/swap_pager.c | ||
---|---|---|
2607 | Why does this segment need cp->index == 0 handling? It should never be zero here because we can't start new I/O while closing. Se set sp->sw_id to NULL in swapgeom_orphan when we drop the last reference. This means that as soon as we get the lock in swapgeom_strategy, cp->index can't be zero when we increment it because sp->sw_id is not NULL. I can add a kassert to this effect befpre the cp->index++. That is, unless I've misunderstood something and you see some path I'm missing where this wouldn't be true. |
sys/vm/swap_pager.c | ||
---|---|---|
2607 | Because there could be in-flight bios while we are in _orphan() ? Otherwise, why would we need to handle the case of cp->index != 0 in orphan or cp->index == 0 in _done() ? |
sys/vm/swap_pager.c | ||
---|---|---|
2607 | Sounds like it would be better to check for == 0 before we increment it then, since we're closing... |
Add safety belt for entering swapgeom_strategy between the orphan call
and when it finally closes. I'm not sure it is possible, but I haven't
been able to prove it isn't.
More refinement based on feedback from kib. Make the acquire/release
semantics of indext clearer. Simplify logic a bit based on
understanding the race kib was worried about better.
sys/vm/swap_pager.c | ||
---|---|---|
2607 | That's a good idea. I understand the race you were worried about now. I didn't before. |
Minor last-minute tweaks. Fix a theoretical race setting sw_id, and combine two lines.
jhb preferences are to not put side-effects into conditions.
Anyway, new version is fine with me.
sys/vm/swap_pager.c | ||
---|---|---|
2555 | Heh, I hadn't realized my comments about side effects had made such a lasting impression. :) In general I do prefer to avoid conditions on side effects. It is more obvious with long if ((error = call_some_thing(lots_of_arguments, and_other + expressions, even_more_args, this_gets_long)) != 0) vs. error = call_some_thing(....); if (error != 0) (Although assigning the result of getopt() in the while condition is one time I always break this.) This one is probably fine, though I find that it requires the reader to exercise a few more brain cells (the -- is prefix, so does it happen before or after? oh, right before, so this will fire when going down to 0) whereas if you break it up you don't have to do that exercise in your head. It is plainly obvious what the order of the decrement is to the comparison: cp->index--; if (cp->index == 0) |
Thanks. I believe I somewhat understand the overall change and from what I understand it looks good to me.
sys/vm/swap_pager.c | ||
---|---|---|
2715 | I think think the point is semantics; if you advertise G_CF_DIRECT_RECEIVE, then you shouldn't do anything that could block in the I/O path. This provider will always be at the top of the stack, and like you say it'll do the alloc before entering GEOM, so maybe the right semantic thing here is to drop the RECEIVE flag. However, keep in mind that doing a sleepable alloc in strategy might backfire if the system is out of memory and the biozone zone is empty (something that can easily happen to us at netflix because we run so much i/o in parallel). We've talked about having a reserve pool of bio's specifically set aside in this code; there's probably no other way to safely handle this until we get better asynchronous queueing in GEOM. |
sys/vm/swap_pager.c | ||
---|---|---|
2715 | G_CF_RECEIVE on a consumer controls direct dispatch on the completion path. It is safe to do that. So I think that bit is OK. Am I misunderstanding the code in geom_io.c? I totally agree that a sleepable alloc may backfire. That's a problem in the current system. That's a weakness I hope to address, but not here. I think having a pool of bios set aside for when we can't allocate out of the normal pool is a good idea. Maybe even one or two chains on that other work I've done would suffice (we can allocate them statically, and then on the completion path have a special case to return the chain to a good state, possibly for the next I/O). This would limit the I/O we could do to one read and one right (in my thinking), but that would guarantee progress when there's no memory for a BIO which would solve the current worst-case scenario where we need to do I/O to recover memory. I'm not sure I understand the last bit, though. Why is better asynch queueing in GEOM required for us to safely do direct dispatch of down and up? I'm not sure I understand the point well enough since I'm having trouble drawing the connection. Can you clarify perhaps? |
sys/vm/swap_pager.c | ||
---|---|---|
2715 | The swap geom strategy does not sleep on the write, i.e. when the swap is the producer of the free memory. When b_iocmd is BIO_WRITE, g_new_bio() is used, which uses M_NOWAIT. When that change (to not sleep for bio allocation) in swap geom was introduced, M_NOWAIT combined the semantic of not sleepable allocation and hint to UMA that the request should drain the reservation of free pages deeper than other requests. This was later split into M_USE_RESERVE malloc flag, which is currently not used by g_new_bio(). I fully agree with the note that swap geom bio allocation should be allowed to drain reserve deeper, that the bio zone should have a reserve itself (but I do not see a good way to prevent the reserve drain from the non-priority bio allocations), and ultimately that the whole bio allocation on the write path is just architectural bug. |