fd: convert f_count to long
AbandonedPublic
Actions

Authored by mjg on Feb 2 2020, 11:37 AM.

Details

Reviewers

jeff
kib
markj
jhb

Summary

First, struct file size is 80 bytes. There is an unused field at the end added in 2005:

ommit d2c9006fb572d62b689561459965dc5b05674746
Author: rwatson <rwatson@FreeBSD.org>
Date:   Wed Feb 2 10:55:32 2005 +0000

    Add a place-holder f_label void * for a future struct label pointer
    required for the port of SELinux FLASK/TE to FreeBSD using the MAC
    Framework.

diff --git a/sys/sys/file.h b/sys/sys/file.h
index d3a9e9f2df1..febbf4d9d27 100644
--- a/sys/sys/file.h
+++ b/sys/sys/file.h
@@ -131,6 +131,7 @@ struct file {
        off_t   f_nextoff;      /*
                                 * offset of next expected read or write
                                 */
+       void    *f_label;       /* Place-holder for struct label pointer. */
 };
 
 #endif /* _KERNEL */

With this removed we can expand f_count from int to long while keeping the total size of 80 bytes (with 4 bytes of padding). Some fields can be made smaller but not enough to fit the extra 4 bytes.

Motivation for the change is 2 fold:

range changes were added to account for legitimate overflows. 32-bit should be more than big enough for 32-bit machines, while 64-bit is of course way more than necessary to aovid overflows
the refcount API grows extra features which not only add work for this consumer (which does not use them), but also reduce the range

With the switch to pure long-sized atomics full range is restored (and more) and there is less work to do.

Test Plan

tinderbox, stress2

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

mjg created this revision.Feb 2 2020, 11:37 AM

Herald added subscribers: emaste, imp. · View Herald TranscriptFeb 2 2020, 11:37 AM

Harbormaster completed remote builds in B29109: Diff 67662.Feb 2 2020, 11:37 AM

mjg retitled this revision from fs: convert f_count to long to fd: convert f_count to long.Feb 2 2020, 11:41 AM

mjg edited the summary of this revision. (Show Details)

I agree with removal of f_label in principle, if it was not used for 5 years it can go away and re-added if a use appears. But this should be done in a separate review/commit, which perhaps should be more widely circulated. E.g. add rwatson as reviewer, and allow some time for him to time-out.

The overflow detecting was added to refcount for struct file. If struct file is not using it then the question becomes why detect overflow in the first place.

We do not want a bunch of implementations duplicating refcount all over the tree. If needed we can make refcount variants. refcount32, refcount64, refcountlong, whatever.

I would really prefer to see some benchmark that demonstrates that this matters at all before we invest time into it.

mjg mentioned this in D23480: Add new refcount API.Feb 3 2020, 3:17 PM

convert to the new API

In D23469#515011, @kib wrote:

I agree with removal of f_label in principle, if it was not used for 5 years it can go away and re-added if a use appears. But this should be done in a separate review/commit, which perhaps should be more widely circulated. E.g. add rwatson as reviewer, and allow some time for him to time-out.

No argument here. I posted this in one review to make it clear that at the end of the day the struct will not grow compared to completely unpatched kernels.

In D23469#515012, @jeff wrote:

The overflow detecting was added to refcount for struct file. If struct file is not using it then the question becomes why detect overflow in the first place.

It's unclear to me what should be done with it given that flags are present. Runtime checks can go if consumers are assumed to not go anywhere near the limit, but assertions would still have to check it. I have no interest in modifying this particular API.

We do not want a bunch of implementations duplicating refcount all over the tree. If needed we can make refcount variants. refcount32, refcount64, refcountlong, whatever.

I agree duplication is iffy. Given what I wanted to do for VFS I decided to abstract this away into an API which in my opinion works fine for both this (long sized f_count) and VFS use.

I would really prefer to see some benchmark that demonstrates that this matters at all before we invest time into it.

Unclear what you mean here. By default we should always avoid branches and memory references, unless it's a problem. The previous iteration of the patch was trivial to produce and did reduce branches. What required more effort was creating a dedicated API, which given the buggy use in VFS was more than justified. int-sized counts for vnodes are already pretty tight these days and one could argue this should switch to long as well. Shifting VFS to only use refcount(9) would require adding primitives to it and would still make it problematic to vn_printf called on assertion failure.

jeff added a reviewer: jhb.Feb 3 2020, 11:23 PM

You are creating work for everyone who has to review this and investigate as well as anyone who has to come after and figure out which of now three refcount apis are correct for their use and you have not yet justified the value of the change beyond pointing out that there is an extra branch. Arbitrarily small performance gains do not justify code complexity. Please demonstrate that this actually matters.

It helps me to take a step back and evaluate the current state of things.

Some background:

We had a couple of user-triggerable file ref overflow bugs in the unix domain socket code over the past year. This motivated me to add a generic saturated state to refcount(9) counters, which adds an extra conditional branch on the sign bit to _acquire and _release. I had file refcount overflows in mind, but there are other potential targets like ucreds that seem worth protecting.
kib added a checked refcount acquire operation for use in the file structure. This was motivated by the observations that 1) systems with large enough memory are susceptible to "legitimate" overflows caused by a user allocating 2^32 descriptors for a file, and 2) fget_unlocked() uses a plain atomic_fcmpset to acquire a reference and is thus not protected by the saturation checks. I think 2) is basically a non-issue now, refcount(9) has grown the appropriate KPIs.
The refcount(9) interface was extended to support some blocking operations needed for VM object PIP counters. In particular, the high bit of the counter is now reserved for a waiter flag.

Some observations/personal feelings:

I'm not sure about the int/long refcount split. How does a developer know which to choose? There is at least one situation (struct file) where we certainly must handle 32-bit overflows, but in general it seems hard to know whether it's necessary. What about ucred? We would have to look at all kernel resources which may hold a ucred reference. That said, I like the simplification that comes from not having to check fhold()'s return value, but I'm still not 100% convinced that it is really necessary to try and sanely handle the case (i.e., return EBADF) where a file ref count has overflowed _if_ we're also detecting saturation and we wrap all f_count operations using refcount(9).
It is not ideal that all refcount(9) consumers incur a cost to handle a waiter bit. I would like to see an alternate KPI that can be used for PIP, maybe barrier_* or so. This would make the saturation checks cheaper too. Has there ever been an objection to doing this? AFAIK refcount(9) was extended because that was simultaneously reasonable and expedient, and the gains from pushing down the object lock outweighed the extra overhead.
I think it is unnecessarily confusing to have refcount *and* refcntint/refcntlong. The listed motivations for the new refcnt* interfaces don't seem compelling to me. We can add some custom assertion support to refcount(9).

At this point I think I am ok with widening the reference counter for struct file, but comments like "I have no interest in modifying this particular API" are discouraging. I don't believe anyone finds the current state of things to be perfect, but creating new and slightly different KPIs to do the same thing just makes things worse.

In D23469#515584, @markj wrote:

It helps me to take a step back and evaluate the current state of things.

Some background:

We had a couple of user-triggerable file ref overflow bugs in the unix domain socket code over the past year. This motivated me to add a generic saturated state to refcount(9) counters, which adds an extra conditional branch on the sign bit to _acquire and _release. I had file refcount overflows in mind, but there are other potential targets like ucreds that seem worth protecting.

See below.

kib added a checked refcount acquire operation for use in the file structure. This was motivated by the observations that 1) systems with large enough memory are susceptible to "legitimate" overflows caused by a user allocating 2^32 descriptors for a file, and 2) fget_unlocked() uses a plain atomic_fcmpset to acquire a reference and is thus not protected by the saturation checks. I think 2) is basically a non-issue now, refcount(9) has grown the appropriate KPIs.

Side remark is that I already modified the code to use refcount_acquire_if_not_zero.

The refcount(9) interface was extended to support some blocking operations needed for VM object PIP counters. In particular, the high bit of the counter is now reserved for a waiter flag.

Some observations/personal feelings:

I'm not sure about the int/long refcount split. How does a developer know which to choose? There is at least one situation (struct file) where we certainly must handle 32-bit overflows, but in general it seems hard to know whether it's necessary. What about ucred? We would have to look at all kernel resources which may hold a ucred reference. That said, I like the simplification that comes from not having to check fhold()'s return value, but I'm still not 100% convinced that it is really necessary to try and sanely handle the case (i.e., return EBADF) where a file ref count has overflowed _if_ we're also detecting saturation and we wrap all f_count operations using refcount(9).

ucred is an excellent example of something which if in legitimate danger of not fitting in an int should use long. crhold() is called all over with no error checking and trying to add it is probably a futile effort. At the same time the struct comes with void *cr_pspare2[2]; meaning we can convert int to long without growing it, all while not creating unfreeable objects due to saturation limit being reached.

For something like file, which we can grow with no extra cost if we assume f_label removal amortizes, the problem similarly disappears.

An argument could be made that saturation checking can remain as a hardening feature in case of refcount leaks. I think that's valid and such a feature, if present, should be optional (compilation time),

It is not ideal that all refcount(9) consumers incur a cost to handle a waiter bit. I would like to see an alternate KPI that can be used for PIP, maybe barrier_* or so. This would make the saturation checks cheaper too. Has there ever been an objection to doing this? AFAIK refcount(9) was extended because that was simultaneously reasonable and expedient, and the gains from pushing down the object lock outweighed the extra overhead.

I think it is unnecessarily confusing to have refcount *and* refcntint/refcntlong. The listed motivations for the new refcnt* interfaces don't seem compelling to me. We can add some custom assertion support to refcount(9).

At this point I think I am ok with widening the reference counter for struct file, but comments like "I have no interest in modifying this particular API" are discouraging. I don't believe anyone finds the current state of things to be perfect, but creating new and slightly different KPIs to do the same thing just makes things worse.

But it's not strictly the same thing. Let me reiterate and expand where I'm coming from.

In the VFS layer there are 2 reference counters using the refcount API. Over time they started getting partially used with pure atomics, which on one hand was a stylistic issue of mixing different ways and on another allowed to easily dump the vnode should an assertion fail. This stylistic issue turned into a bug by introduction of flags in the counter, meaning this either has to start using the refcount API for all operations or move away from it entirely. I intend to add my own flags into one of the counters [1]. Thus reducing the range by refcount bits and vfs-specific bits sounds incredibly dodgy to me, putting the entire thing on a timer for a rewrite should any extra flags get added for either facility. With this and lack of custom assertions in the refcount API I was leaning towards vfs-local routines which do most of an equivalent work.

Apart from that I decided to do what I consider to be a clean up of the file refcount use, where I consider conversion to long to be overdue.

A fair point was made that this effectively repeats a lot of logic and if this is to be done at all, it should be dedupped. So I decided to roll with it and here we are.

Long-sized counter for struct file simplifies the code by removing error checking for reaching the limit. Since refcount API is int-only, I also moved it to pure atomics which also has a welcome (but not crucial) side effect of doing less work when manipulating the count.

The VFS layer needs to unify its counters on *something* and I consider an int with flags coming from refcount API too narrow given plans for custom bits. With the proposed patchset the layer is cleaned up in terms of mixed use, the code gets internally deduped (see vget_finish before and after for an example [D23481]) and also happens to have a very welcome (but not crucial) addition of shortening the generated code.

Another note is that we should strive to have functional, debug-friendly and lean core primitives and the last part means avoiding memory references and branches if it can be helped. Currently all operations add extra branches and releasing the last count (which happens all the time in vfs without freeing anything) induces a call to refcount_release_last, which as I understand wont go away even if flags get moved to another namespace. I presume it can be replaced with more branches at the call site, but the point stands -- if given consumer has no use for any of it (modulo optional hardening), it should not be paying for it. Otherwise we end up with small slowdowns everywhere which are hard to remove when people argue that the particular one does not stand out.

All the work done here (custom assertions and most notably type-safety) should be also expanded to cover the current refcount API. I think this is bad time to do it in that there is an unmerged patchset using it and the change would add avoidable churn. Instead, should someone want to sort this out, I think a new namespace should be added with appropriate types. One idea would be add a suffix indicating this variant of reference counting supports flags and/or saturation checks. However, given the reasoning above I consider changing refcount(9) itself to be orthogonal to the work at hand.

[1] (TL;DR when ->v_usecount transitions to 0 (which happens all the time) vnode interlock is taken to check if there is anything to do; vast majority of the time there is not but this results in lock contention. if filesystems get modified to denote with a bit in the count that there is work, this avoids the problem for the common case).

In D23469#520752, @mjg wrote:

In D23469#515584, @markj wrote:

Some observations/personal feelings:

I'm not sure about the int/long refcount split. How does a developer know which to choose? There is at least one situation (struct file) where we certainly must handle 32-bit overflows, but in general it seems hard to know whether it's necessary. What about ucred? We would have to look at all kernel resources which may hold a ucred reference. That said, I like the simplification that comes from not having to check fhold()'s return value, but I'm still not 100% convinced that it is really necessary to try and sanely handle the case (i.e., return EBADF) where a file ref count has overflowed _if_ we're also detecting saturation and we wrap all f_count operations using refcount(9).

ucred is an excellent example of something which if in legitimate danger of not fitting in an int should use long. crhold() is called all over with no error checking and trying to add it is probably a futile effort. At the same time the struct comes with void *cr_pspare2[2]; meaning we can convert int to long without growing it, all while not creating unfreeable objects due to saturation limit being reached.

The "if in legitimate danger" bit is part of my point. Is it a danger or not? For ucred I suspect not, but it cannot really be reasoned about precisely. So how do programmers know in general which width to choose?

For something like file, which we can grow with no extra cost if we assume f_label removal amortizes, the problem similarly disappears.

An argument could be made that saturation checking can remain as a hardening feature in case of refcount leaks. I think that's valid and such a feature, if present, should be optional (compilation time),

It is not ideal that all refcount(9) consumers incur a cost to handle a waiter bit. I would like to see an alternate KPI that can be used for PIP, maybe barrier_* or so. This would make the saturation checks cheaper too. Has there ever been an objection to doing this? AFAIK refcount(9) was extended because that was simultaneously reasonable and expedient, and the gains from pushing down the object lock outweighed the extra overhead.

I think it is unnecessarily confusing to have refcount *and* refcntint/refcntlong. The listed motivations for the new refcnt* interfaces don't seem compelling to me. We can add some custom assertion support to refcount(9).

At this point I think I am ok with widening the reference counter for struct file, but comments like "I have no interest in modifying this particular API" are discouraging. I don't believe anyone finds the current state of things to be perfect, but creating new and slightly different KPIs to do the same thing just makes things worse.

But it's not strictly the same thing. Let me reiterate and expand where I'm coming from.

In the VFS layer there are 2 reference counters using the refcount API. Over time they started getting partially used with pure atomics, which on one hand was a stylistic issue of mixing different ways and on another allowed to easily dump the vnode should an assertion fail. This stylistic issue turned into a bug by introduction of flags in the counter, meaning this either has to start using the refcount API for all operations or move away from it entirely. I intend to add my own flags into one of the counters [1]. Thus reducing the range by refcount bits and vfs-specific bits sounds incredibly dodgy to me, putting the entire thing on a timer for a rewrite should any extra flags get added for either facility. With this and lack of custom assertions in the refcount API I was leaning towards vfs-local routines which do most of an equivalent work.

Apart from that I decided to do what I consider to be a clean up of the file refcount use, where I consider conversion to long to be overdue.

A fair point was made that this effectively repeats a lot of logic and if this is to be done at all, it should be dedupped. So I decided to roll with it and here we are.

Long-sized counter for struct file simplifies the code by removing error checking for reaching the limit. Since refcount API is int-only, I also moved it to pure atomics which also has a welcome (but not crucial) side effect of doing less work when manipulating the count.

Sure, I guess having int and long refcounts is useful. But there should be a default "refcount" that gets used unless you know what you are doing. So what about having refcount_* and refcountl_*?

The VFS layer needs to unify its counters on *something* and I consider an int with flags coming from refcount API too narrow given plans for custom bits. With the proposed patchset the layer is cleaned up in terms of mixed use, the code gets internally deduped (see vget_finish before and after for an example [D23481]) and also happens to have a very welcome (but not crucial) addition of shortening the generated code.

Another note is that we should strive to have functional, debug-friendly and lean core primitives and the last part means avoiding memory references and branches if it can be helped. Currently all operations add extra branches and releasing the last count (which happens all the time in vfs without freeing anything) induces a call to refcount_release_last, which as I understand wont go away even if flags get moved to another namespace. I presume it can be replaced with more branches at the call site, but the point stands -- if given consumer has no use for any of it (modulo optional hardening), it should not be paying for it. Otherwise we end up with small slowdowns everywhere which are hard to remove when people argue that the particular one does not stand out.

Sure, but your patches do nothing to address that, which is really at the core of my complaint. I tend to agree with this reasoning though, at least regarding the waiter flag in refcount(9). I added an alternate interface for use by existing consumers of refcount_wait(), etc.: https://reviews.freebsd.org/D23723 . Comments would be appreciated.

All the work done here (custom assertions and most notably type-safety) should be also expanded to cover the current refcount API. I think this is bad time to do it in that there is an unmerged patchset using it and the change would add avoidable churn. Instead, should someone want to sort this out, I think a new namespace should be added with appropriate types. One idea would be add a suffix indicating this variant of reference counting supports flags and/or saturation checks. However, given the reasoning above I consider changing refcount(9) itself to be orthogonal to the work at hand.

The goals all sound reasonable, but I still don't get this reasoning for adding new interfaces that are 98% the same as an existing one. What unmerged patchset are you referring to?

[1] (TL;DR when ->v_usecount transitions to 0 (which happens all the time) vnode interlock is taken to check if there is anything to do; vast majority of the time there is not but this results in lock contention. if filesystems get modified to denote with a bit in the count that there is work, this avoids the problem for the common case).

Is this implemented somewhere?

In D23469#521103, @markj wrote:

In D23469#520752, @mjg wrote:

In D23469#515584, @markj wrote:

Some observations/personal feelings:

I'm not sure about the int/long refcount split. How does a developer know which to choose? There is at least one situation (struct file) where we certainly must handle 32-bit overflows, but in general it seems hard to know whether it's necessary. What about ucred? We would have to look at all kernel resources which may hold a ucred reference. That said, I like the simplification that comes from not having to check fhold()'s return value, but I'm still not 100% convinced that it is really necessary to try and sanely handle the case (i.e., return EBADF) where a file ref count has overflowed _if_ we're also detecting saturation and we wrap all f_count operations using refcount(9).

ucred is an excellent example of something which if in legitimate danger of not fitting in an int should use long. crhold() is called all over with no error checking and trying to add it is probably a futile effort. At the same time the struct comes with void *cr_pspare2[2]; meaning we can convert int to long without growing it, all while not creating unfreeable objects due to saturation limit being reached.

The "if in legitimate danger" bit is part of my point. Is it a danger or not? For ucred I suspect not, but it cannot really be reasoned about precisely. So how do programmers know in general which width to choose?

I tried to showcase that adding long - if needed - does not even necessarily grow the target struct. Maybe I was not clear enough, but I perfectly support "if in doubt use this" refcount type which would also do saturation checking. In fact, the current refcount API is what I had in mind to do it.

For something like file, which we can grow with no extra cost if we assume f_label removal amortizes, the problem similarly disappears.

An argument could be made that saturation checking can remain as a hardening feature in case of refcount leaks. I think that's valid and such a feature, if present, should be optional (compilation time),

It is not ideal that all refcount(9) consumers incur a cost to handle a waiter bit. I would like to see an alternate KPI that can be used for PIP, maybe barrier_* or so. This would make the saturation checks cheaper too. Has there ever been an objection to doing this? AFAIK refcount(9) was extended because that was simultaneously reasonable and expedient, and the gains from pushing down the object lock outweighed the extra overhead.

I think it is unnecessarily confusing to have refcount *and* refcntint/refcntlong. The listed motivations for the new refcnt* interfaces don't seem compelling to me. We can add some custom assertion support to refcount(9).

At this point I think I am ok with widening the reference counter for struct file, but comments like "I have no interest in modifying this particular API" are discouraging. I don't believe anyone finds the current state of things to be perfect, but creating new and slightly different KPIs to do the same thing just makes things worse.

But it's not strictly the same thing. Let me reiterate and expand where I'm coming from.

In the VFS layer there are 2 reference counters using the refcount API. Over time they started getting partially used with pure atomics, which on one hand was a stylistic issue of mixing different ways and on another allowed to easily dump the vnode should an assertion fail. This stylistic issue turned into a bug by introduction of flags in the counter, meaning this either has to start using the refcount API for all operations or move away from it entirely. I intend to add my own flags into one of the counters [1]. Thus reducing the range by refcount bits and vfs-specific bits sounds incredibly dodgy to me, putting the entire thing on a timer for a rewrite should any extra flags get added for either facility. With this and lack of custom assertions in the refcount API I was leaning towards vfs-local routines which do most of an equivalent work.

Apart from that I decided to do what I consider to be a clean up of the file refcount use, where I consider conversion to long to be overdue.

A fair point was made that this effectively repeats a lot of logic and if this is to be done at all, it should be dedupped. So I decided to roll with it and here we are.

Long-sized counter for struct file simplifies the code by removing error checking for reaching the limit. Since refcount API is int-only, I also moved it to pure atomics which also has a welcome (but not crucial) side effect of doing less work when manipulating the count.

Sure, I guess having int and long refcounts is useful. But there should be a default "refcount" that gets used unless you know what you are doing. So what about having refcount_* and refcountl_*?

See below.

The VFS layer needs to unify its counters on *something* and I consider an int with flags coming from refcount API too narrow given plans for custom bits. With the proposed patchset the layer is cleaned up in terms of mixed use, the code gets internally deduped (see vget_finish before and after for an example [D23481]) and also happens to have a very welcome (but not crucial) addition of shortening the generated code.

Another note is that we should strive to have functional, debug-friendly and lean core primitives and the last part means avoiding memory references and branches if it can be helped. Currently all operations add extra branches and releasing the last count (which happens all the time in vfs without freeing anything) induces a call to refcount_release_last, which as I understand wont go away even if flags get moved to another namespace. I presume it can be replaced with more branches at the call site, but the point stands -- if given consumer has no use for any of it (modulo optional hardening), it should not be paying for it. Otherwise we end up with small slowdowns everywhere which are hard to remove when people argue that the particular one does not stand out.

Sure, but your patches do nothing to address that, which is really at the core of my complaint. I tend to agree with this reasoning though, at least regarding the waiter flag in refcount(9). I added an alternate interface for use by existing consumers of refcount_wait(), etc.: https://reviews.freebsd.org/D23723 . Comments would be appreciated.

They do a lot in that they move VFS away from the current API into a new one which matches what I what I would like to see. Again, see below.

All the work done here (custom assertions and most notably type-safety) should be also expanded to cover the current refcount API. I think this is bad time to do it in that there is an unmerged patchset using it and the change would add avoidable churn. Instead, should someone want to sort this out, I think a new namespace should be added with appropriate types. One idea would be add a suffix indicating this variant of reference counting supports flags and/or saturation checks. However, given the reasoning above I consider changing refcount(9) itself to be orthogonal to the work at hand.

The goals all sound reasonable, but I still don't get this reasoning for adding new interfaces that are 98% the same as an existing one. What unmerged patchset are you referring to?

I meant the object_concurrency branch which presumably still has patches which would conflict should someone convert vm object et al counts into something else.

Perhaps I was not also clear enough, but at the time the waiter bit functionality was being introduced I suggested it is implemented somewhere else and that was met with rather strong opposition. Thus instead of suggesting moving it away I wrote new API for reason stated above (bits from ref + vfs make int really small).

With D23723 in place the landscape looks quite differently. I think we can stick to refcount and refcountl as proposed above, with the smaller variant also doing saturation checking. I would argue an int-sized variant which does not do any of that (refcount_raw?) would be prudent but I'm not going to insist on it in the light of the new blockcount API.

That said, I can rework my macros to do the trick for refcount namespace. There will be some surgery needed to allow consumers to transition without breaking compilation but nothing major.

[1] (TL;DR when ->v_usecount transitions to 0 (which happens all the time) vnode interlock is taken to check if there is anything to do; vast majority of the time there is not but this results in lock contention. if filesystems get modified to denote with a bit in the count that there is work, this avoids the problem for the common case).

Is this implemented somewhere?

Not yet. There are blockers I need to care of first.

mjg abandoned this revision.Mar 12 2024, 10:31 PM

Herald added a subscriber: linuxkpi. · View Herald TranscriptMar 12 2024, 10:31 PM